Archive for April, 2009

Indexing of Saskatchewan Well Bulletins

With the marketing effort well underway, it was time to pay some attention to building up the body of of documentation available for the Onramp search engine.  Next on the list were the Well Bulletins, Saskatchewan’s equivalent of the ST1.  The first job was to download a copy of each of the reports and upload it to Intellog’s S3 infrastructure  All reports were uploaded, right back to 2003-01-02, the first day they were made available.  (Note: Well Bulletins are not published on weekends or holidays.)

But of course, the real objective was to make the Well Bulletins searchable with Onramp.  To achieve this, the now more-or-less standard set of steps for the indexing of text files was followed;

  1. A copy of the files to be indexed was captured in a parallel directory hierarchy on a local workstation, and then the list of files was generated into a file called manifest.txt.  This was accomplished with the DOS (remember that?) command dir /s /b /on /a-d >manifest.txt.
  2. manifest.txt was imported into a temporary SQL Server table called dbo.manifest$, which was subsequently used to populate localPathTxt and urlTxt of base.Document.  Inserting instances into base.Document automatically assigns the globally unique identifiers, which makes the name of these documents unique in the known universe.
  3. The Excel spreadsheet fileSizeHack.xls was used to establish the size of each Well Bulletin file, and then this data was imported into the temporary file dbo.Filtered$.
  4. The data from dbo.manifest$ and dbo.Filtered$ was then combined, and the iSentence-compliant column in base.Document.xml was populated with sizeAmt, sourceLbl, typeLbl and udt.
  5. SQL was then used to generate wrapBatch.sh, which is a series of calls to a second shell script called wrap.sh.  The latter takes the parameters passed to it in each line of wrapBatch.sh, and generates a Solr-compliant XML file.  Each XML is stored in a separate file named using the globally unique identifier created when the reference to the document was inserted into base.Document, followed by the .xml extension.
  6. SQL was also used to generate index.sh, which issues a cURL statement to feed the wrap.sh-generated XML files to Solr, and then issuing a commit statement when all else is done.
  7. These two files were then uploaded to the server.  wrapBatch.sh was executed, and the XML files generated.  These XML files were then downloaded back to the local workstation.  (For those who think this sounds a little bass ackwards, it’s simply because wrap.sh contains command syntax native to Linux — wrap.sh doesn’t currently work on Windows).
  8. index.sh was executed on the local workstation (using the temporary name index.bat), and the local version of Solr was populated.  A few modifications to wrap.sh were required to remove some control characters which crop up in the Well Bulletin files.  These modifications necessitated a couple of iterations of the previous and this step.

Once the test indexing was complete, a few modifications to /Onramp/xml/ApplicationDefinition.xml/Onramp/xsl/outputSearchResult.xsl and /Onramp/php/Onramp.php were required to accommodate the new file type and its source.  Later in the evening, index.sh was executed on the production server, and the modified files were uploaded, which completed the process.  The only gotcha was Java running out of heap space, which was cured (?) by using java -Xmx512M -Xms512M -jar start.jar to increase the initial and maximum heap space available.  It may have to be increased still further in the future.

Code Shavings  Actually, both the raw text (TXT), and the nicely delimited equivalent (CSV) files were both downloaded from the SER website, and it was the original intention was to index both.  But the excerpt of the CSV files displayed on the results screen looked really rough.  So it was decided to drop the CSV from the index, and to eventually make the CSV file available as a separate link, immediately adjacent to the TXT, on the results page.  ♦  The SQL used to accomplish the steps above was captured in the sqlTtxt.txt files found in the working folders saskatchewan/2009/04/24 and 27.  ♦  Thanks to Thierry Collogne and Caucho for their assistance in resolving the Java heap space problem.

Posted on 28th April 2009
Under: Developers' Journal | No Comments »

Intellog Announces the ‘Onramp’ Search Engine featuring Fully Indexed ERCB Directives, License, Drilling and Pipeline Reports

Intellog has released its ‘Onramp’ search engine and kicked it off with the fully indexed text of all ERCB Directives, as well as the fully indexed text of every Well Licenses Issued (ST1), Drilling Activity (ST49) and Pipeline Approval & Disposition Daily List (ST96) reports back to 2001. It can be accessed immediately and without charge at http://www.intellog.com/onramp

Onramp is a next generation search engine — like Google® and Yahoo®, Onramp makes finding information easy; enter a few search terms to identify the target of your search and the information meeting your criteria is displayed. But Onramp improves on the state-of-the-art in two important ways; it tailors the tools and content specifically to the energy industry and is capable of indexing information right down to the individual page. “The mainstream search engines have set the standard for what people expect when searching for information. So that was our jumping off point,” said Terence Gannon, Intellog’s Founder and CEO, “however, we are focused on the energy industry and its special, unique requirements. We feel we are able to deliver a significantly improved user experience in this area.” Gannon went on to say “industry professionals are stretched to the limit – they absolutely demand tools like Onramp to save them time and give them better search results.

Intellog is a Calgary-based company founded in 2008 specializing in industry-specific search technology and related applications. For further information, contact Terence Gannon [by leaving a comment below].Click here for PDF version.

Posted on 20th April 2009
Under: Business Development, Press Releases | No Comments »

Onramp Added to Amazon Web Services Solutions Catalogue

Applications built using Amazon Web Services (AWS), like Intellog’s Onramp, have the opportunity to be listed in the AWS Solutions Catalogue.  With the recent release of the beta, a description of Onramp was submitted to AWS, and was published in the catalogue shortly thereafter.  Click here to take a look, and please feel free to provide comments or feedback below.

Posted on 17th April 2009
Under: Business Development | No Comments »