Indexing of Saskatchewan Well Bulletins
With the marketing effort well underway, it was time to pay some attention to building up the body of of documentation available for the Onramp search engine. Next on the list were the Well Bulletins, Saskatchewan’s equivalent of the ST1. The first job was to download a copy of each of the reports and upload it to Intellog’s S3 infrastructure All reports were uploaded, right back to 2003-01-02, the first day they were made available. (Note: Well Bulletins are not published on weekends or holidays.)
But of course, the real objective was to make the Well Bulletins searchable with Onramp. To achieve this, the now more-or-less standard set of steps for the indexing of text files was followed;
- A copy of the files to be indexed was captured in a parallel directory hierarchy on a local workstation, and then the list of files was generated into a file called manifest.txt. This was accomplished with the DOS (remember that?) command
dir /s /b /on /a-d >manifest.txt. - manifest.txt was imported into a temporary SQL Server table called dbo.manifest$, which was subsequently used to populate localPathTxt and urlTxt of base.Document. Inserting instances into base.Document automatically assigns the globally unique identifiers, which makes the name of these documents unique in the known universe.
- The Excel spreadsheet fileSizeHack.xls was used to establish the size of each Well Bulletin file, and then this data was imported into the temporary file dbo.Filtered$.
- The data from dbo.manifest$ and dbo.Filtered$ was then combined, and the iSentence-compliant column in base.Document.xml was populated with sizeAmt, sourceLbl, typeLbl and udt.
- SQL was then used to generate wrapBatch.sh, which is a series of calls to a second shell script called wrap.sh. The latter takes the parameters passed to it in each line of wrapBatch.sh, and generates a Solr-compliant XML file. Each XML is stored in a separate file named using the globally unique identifier created when the reference to the document was inserted into base.Document, followed by the .xml extension.
- SQL was also used to generate index.sh, which issues a cURL statement to feed the wrap.sh-generated XML files to Solr, and then issuing a commit statement when all else is done.
- These two files were then uploaded to the server. wrapBatch.sh was executed, and the XML files generated. These XML files were then downloaded back to the local workstation. (For those who think this sounds a little bass ackwards, it’s simply because wrap.sh contains command syntax native to Linux — wrap.sh doesn’t currently work on Windows).
- index.sh was executed on the local workstation (using the temporary name index.bat), and the local version of Solr was populated. A few modifications to wrap.sh were required to remove some control characters which crop up in the Well Bulletin files. These modifications necessitated a couple of iterations of the previous and this step.
Once the test indexing was complete, a few modifications to /Onramp/xml/ApplicationDefinition.xml, /Onramp/xsl/outputSearchResult.xsl and /Onramp/php/Onramp.php were required to accommodate the new file type and its source. Later in the evening, index.sh was executed on the production server, and the modified files were uploaded, which completed the process. The only gotcha was Java running out of heap space, which was cured (?) by using java -Xmx512M -Xms512M -jar start.jar to increase the initial and maximum heap space available. It may have to be increased still further in the future.
Code Shavings Actually, both the raw text (TXT), and the nicely delimited equivalent (CSV) files were both downloaded from the SER website, and it was the original intention was to index both. But the excerpt of the CSV files displayed on the results screen looked really rough. So it was decided to drop the CSV from the index, and to eventually make the CSV file available as a separate link, immediately adjacent to the TXT, on the results page. ♦ The SQL used to accomplish the steps above was captured in the sqlTtxt.txt files found in the working folders saskatchewan/2009/04/24 and 27. ♦ Thanks to Thierry Collogne and Caucho for their assistance in resolving the Java heap space problem.
Posted on 28th April 2009
Under: Developers' Journal | No Comments »