Archive for December, 2008

Implementing Hit Highlighting with Solr*

At the conclusion of the previous post, the next technical objective was identified.  This was to populate the description column of the result set with information sensible in the context of the type of document being shown in a particular row.  In the case of completion reports, it makes sense to show the well location and its verbose name.  It’s a little more problematic in the case of the ST1/ST49, however.  After some consideration, it was determined that a snippet of the report surrounding the search term(s) would be useful.  This will enable users to quickly scan down the result set and determine if the full report is worth a closer look.  In Solr parlance, this is referred to as hit highlighting, or as I refer to it from time-to-time below, snippets.  Getting this implemented was a matter of incrementally solving a series of small problems;

  • The Solr administrative interface enables the testing of hit highlighting by simply clicking a checkbox, but when I did this — whoops — no snippet.  However, in order for highlighting to work, the snippet has to come from somewhere.  The entire text of each report is being indexed, so what gives?  Turns out, of course, there’s a huge difference between parsing/indexing the entire document, and actually being able to retrieve the original text.  In order to do the latter, you have to remember to set (to true, that is) the stored attribute of the field element found in schema.xml.  It seems odd to have to store the entire document twice, but disk space is cheap, and performance doesn’t seem to suffer.  That done, the snippets began to appear pretty much as expected. 
  • The layout of the response XML produced by Solr when highlighting is enabled simply adds a new lst element.  It’s identified with name highlighting.  It then repeats the unique identifiers found in the original result element, and associates each with the snippet, enclosed in a str element.  While it would have made more sense to me to embed the str element within result, merging the str element with the data found in result/doc can be accomplished through the use of the uid they share.  This was accomplished in mergeAndStyleResponse.xsl using the xsl:key syntax.
  • Even with the items above handled, there were still a few snippets not appearing.  Some research revealed Solr only digs out the snippet for the number of characters specified by the URL parameter hl.maxAnalyzedChars.  By setting this value to 350,000, to cover the largest document indexed document, the problem was solved.  I have no idea what the impact to performance having such a large value will be, but that’s a problem which can be dealt with down the road.
  • By default, Solr passes the snippet back with output escaping enabled.  This means the ‘<’ (less than) and ‘>’ (greater than) symbols required by the <em></em> tags surrounding the hit are entitized.  In other words, they show up surrounded with &gt; and &lt; strings.  This problem is easily solved by using the disable-output-escaping='yes' syntax when transforming the the XML into renderable HTML.  But the default tags weren’t what I wanted anyway — I wanted the hit to be bold, not italicized.  So, I also added the parameters &hl.simple.pre=%3CB%3E&hl.simple.post=%3CB%2F%3E, which is an encoded version of the <B> and </B> tags.
  • Right now, the inputSeachCriteria.php form simply captures the user’s query information in the keywordTxa text box, and passes it right through to Solr unaltered and unedited.  There’s a problem with that, though — assuming the user puts in multiple keywords, there are spaces between those keywords, which the application will attempt to pass in the URL.  That doesn’t work, of course, but fortunately, the use of the PHP function urlencode automatically processes the text into a URL compatible version.
  • That last minor glitch was the seeming inability of Safari to display some characters, and its propensity to substitute the little-black-diamond-with-a-question-mark-in-it.  The very ugly fix for this was to use the str_replace PHP syntax when echoing out the transformed XML.  Again, that’s something which likely needs to be revisited in the future, but it will do for now.

Code Shavings  It may be in the documentation somewhere, but I discovered that reindexing the entire database effectively doubles the size of the index data on disk.  Obviously, it’s retaining both the old version of the index records and adding the new ones.  I would expect there to be a compaction utility or some such thing, but have not yet discovered what that is.

*Note the new usage of case for Solr.  The new Solr logo which has been actively discussed on the Solr forum of late made particular note that Solr really has lost its original acronym context, and therefore should simply be referred to as a proper noun.

Posted on 22nd December 2008
Under: Developers' Journal | 6 Comments »

Integration of Completion Reports With The Solr/Lucene Index

The previous post implied the beta release of the search application was imminent.  However, it’s been a struggle to convince Intellog’s ISP to configure DNS so data.E.intellog.com maps to S3, and app.E.intellog.com maps to EC2*, so this will all look like one integrated website.   While that’s working through the system, the search of ST1/ST49 reports was made available to a very limited audience; the members of the mailing list Intellog - OptIn**.  Although availability may be spotty from time-to-time, go to http://www.intellog.com/roundabout/search, if you would like to give it a try.  It accepts standard Lucene query syntax, but the quick version of the latter is to simply type in the name of a company that has either licensed or drilled a well in Alberta anytime in the last eight years.  Then click the Search button at the bottom right, and you will get a list of the reports containing that name.  Click on the hyperlinks on the right to view the reports themselves.

With that in hand, base.Document was updated with the last couple of week’s worth of ST1/ST49s, using the ERCB.putDocument stored procedure, which has to be executed once for each date’s worth of documents.  Then, the stored procedure ERCB.getWrapShExecutableTxt was created and used to generate one iteration of wrap.sh for each of the ST1/ST49s.  The resulting script was moved up onto the running EC2 instance that is currently hosting the search application.  Once the script had been executed and the 5663 XML files had been generated, they were downloaded to the laptop, with the intention of having a complete set of these Solr-compatible XMLs.  This is so reindexing exercises can be conducted offline.

More-or-less at the same time, a number of PDF-formatted well completion reports from Manitoba were processed so they were searchable-PDF format.  The resulting text was extracted and processed into Solr-compatible XML and four of them indexed with the instance of Solr/Lucene running on the development laptop.  The only hitch encountered was truncation of the very long texts found in some of the longer completion report PDFs.  solrconfig.xml (the Solr configuration XML) contains the tag maxFieldLength (twice, actually) which has a default value of 10000.  By increasing the value of this tag to 50000, even the largest of the completion reports indexed correctly and completely.  An open question is the downside of increasing this value.  Is there a performance impact?  An impact on the size of the index file?

The E.getLibraryXml stored procedure had to be updated so that it generated tags relevant to completion reports, in addition to the ST1/ST49s.  The tags agencyLbl (STEM well license number), sizeAmt (size of the PDF), wellLbl (well’s unique identifier) and wellNm (verbose well name) were added to those Document entities related to completion reports.  In the case of the dt tag, it’s populated with the report issue date in the case of ST1/ST49, and with the rig release date in the case of completion reports.  In addition, sourceLbl can now contain ERCB or STEM, depending on who supplied the document in the first place, and typeLbl can now contain CMP (completion report) in addition to the existing ST1 and ST49 values.  One other minor detail — the Z was dropped from the dt tag, to imply that the date/times shown are local, as per ISO 8601.

In any event, there is now a Solr/Lucene index — on the development laptop, at least — which contains references to both ST1/ST49 documents and well completion reports.  It was fairly obvious, fairly quickly, the format of the result output was heavily biased toward the ST1/ST49, and the format had to be modified to better reflect the shared attributes of the two classes of documents.  In particular the ‘description’ column has two distinctly different types of content depending on the report type.  In the case of the completion report, it logically would contain the well location and name.  Rather than leave this column blank for ST1/ST49 reports, or put in the same lame, repetitive description, it would make sense to include a snippet of the ST1/ST49 that resulted in the document being returned in the result set.  But the details of that will be contained in the next post! 

Thank you for reading, and as always, if you have any questions or comments, please do not hesitate to leave them belwo.

Code Shavings  The ongoing quest to make my Windows laptop look and act like a Linux machine continued with the download of a Windows version of the cURL utility, described previously in Code Shavings.  Once downloaded, I simply extracted the files, and then moved the resulting directory to C:\Program Files\cURL, and then added the latter folder to my PATH.

*A more complete description of the changes to the taxonomy of intellog.com is the subject of a future post.  Stay tuned.

**Please Contact Us, if you would like to be added to this mailing list.

Posted on 17th December 2008
Under: Developers' Journal | 2 Comments »

Preparation for the Beta Launch of Roundabout Search

It’s a bit hard to tell, but finally, the pieces are coming together to launch* the first chunk of functionality for Roundabout — to wit, the ability to search the full text of every ST1 and ST49 issued by the ERCB in the last eight years.  The last significant step is to organize the development and production environments so the short-but-growing list of files which make up the the application can be efficiently deployed.  First, here are the steps required to get the production machine up and running;

  • An instance of the EC2 image intellogV4 (ami-6246a20b) — as previously described in excruciating detail — is started.  This results in a standard Fedora 8 environment with SOLR, Lucene and supporting components installed and ready to go. 
  • schema.xml and solrconfig.xml should already be configured, and should only require an update if they have changed on the development machine.  If they have, they will likely have been incorporated into a new image, so the steps described herein should not change, other than to use the ID of the new instance.
  • The EBS volume containing the SOLR index is attached to the Linux device /dev/sdh with Elasticfox
  • The attached volume is mounted with the commands mkdir /mnt/data-store (only required if data-store does not already exist), and then mount /dev/sdh /mnt/data-store.  This makes the EBS volume visible to the EC2 instance.
  • For the time being, Apache web services are started manually, with the command /etc/rc.d/init.d/httpd start.  (stop or restart can also be substituted for start to — er — stop or restart Apache services, funnily enough).
  • The command java -jar /usr/jetty/start.jar & starts SOLR up, and gets it listening to its standard port.  I’m not 100% sure the & sign, used to start the process in background is correct, but it seems to work, for now, at least.

With the exception of remapping the reserved Elastic IP to the instance, the start up is complete.  However, it’s best to wait to remap until all the application work is done, as described below.  Or if an instance is currently running behind the Elastic IP, leave it there ’til the new instance is ready.  At that time, the remap to the new instance can be accomplished very quickly, minimizing the service interruption.  Furthermore, it’s very likely the steps above can further be automated and streamlined, but they’re already pretty quick to run already, and therefore sufficient for now.

Next, the following describes the steps to bundle, transfer and install the application on the production server.  The following assumes a parallel directory structure existing on the development machine, containing only those files that make up the application.  In my case, I created a separate ‘root’ directory named after the subdomain (app.E.intellog.com), and then only moved the files to it that needed to go to the production machine.  It’s also assumed development is occurring on a Windows machine;

  • From the Windows command line on the development machine, the current directory is switched to the DocumentRoot.  Then, dir /s /b /on /a-d >Roundabout/manifest.txt  creates a text file with a recursive list of all files in the DocumentRoot and below.
  • Unfortunately, the command above produces  a text file with a full path, so it has to be edited to remove everything below the DocumentRoot for each line.  I’m sure there must be a better way, but this works, for now.
  • The files identified above are compressed and bundled with the 7-Zip command 7z a -ttar Roundabout.tar @Roundabout/manifest.txt.  This creates a Linux-compatible tar file with the files found in manifest.txt, as described immediately above.
  • Roundabout.tar is moved to the DocumentRoot of the soon-to-be-production EC2 instance started above.  As mentioned a number of times before, FileZilla is my weapon of choice in this regard. 
  • On the production server, switch to the DocumentRoot, and then tar -xf Roundabout.tar to unbundle the application, and automatically place the files in their correct subdirectories.  tar will create will create the subdirectories, if necessary.
  • Test the search application at http://www.intellog.com/roundabout/search  Not that in the case of the previous URL, mod_write is being used to turn the somewhat obscure (but logical) URL into something a bit more memorable.  Also note this is a pre-release announcement, so the application will be up and down ’til further notice.

That’s it!  Now is the time to remap the instance to the reserved Elastic IP, and the world should be able to see the results of all this hard work.  If you have any questions or comments, please provide them below, or alternatively, Contact Us.

*Well, in beta at least.  But then again Gmail has been in beta since its launch over four years ago, so I really don’t know what ‘beta’ means any more.

Posted on 12th December 2008
Under: Developers' Journal | 2 Comments »

More EC2 Configuration

This post could just as easily have been entitled; "Just when I thought I was out, they pull me back in!"*  I thought I was finished with EC2 configuration for the time being, but I wasn’t.  Turns out Intellog’s hosting service does not support PHP XSLT functionality, making it effectively impossible to deploy the heavily-XSLT-dependent Intellog applications to one of their servers.  To get around this problem, it seemed like it should be possible to take the same approach with XSLT as I had with SOLR/Lucene.  In other words, wrap the XSLT logic in a servlet.  It’s not REST, exactly, but it’s effectively  the same thing — it provides the ability to access the transformation logic through a simple HTTP request.  After some research, it turns out this way of doing things is fairly commonplace, and is covered on the Saxonica website.

But once I was down that particular technological road, it actually made more sense to have inputSearchCriteria.php, outputSearchResult.php, SOLR, XSLT and all the supporting code on the same instance of EC2.  Seamless integration with the rest of the Intellog website could then be handled through URL rewriting or some similar technique.  At the same time, I realized the AMI I had previously chosen for SOLR on EC2 was out of date.  The tactical goal of the new work, therefore, was to use a more up-to-date AMI — with Fedora 8 — then install SOLR/Lucene, XSLT and the application code to support the inputSearchCriteria.php and outputSearchResult.php pages.  This would be an effective solution in the short term, and because of its implementation on EC2, highly scalable in the future by increasing the capacity of each instance, and potentially adding multiple instances.

With this approach in mind, AWS AMI ami-2b5fba42 was chosen for its basic Fedora 8 image.  yum was then used to install httpd.i386 (ie. Apache), php.i386 and php-xml.i386 packages with the following commands;

yum install httpd.i386
yum install php.i386
yum install php-xml.i386

Following that, it was only necessary to install Java Runtime (jre-6u10-linux-i586-rpm.bin), and then SOLR (apache-solr-1.3.0.tgz), in precisely the same way they was installed previously.  The EBS volume was mounted on the new instance, and solrconfig.xml and and schema.xml were copied up from the development laptop.  When this was all up and running, yet another custom image was created and saved as intellogV4, and registered as ami-6246a20b.

Finally, the Intellog application files related to the search were migrated to the EC2 instance.  This was done one file at a time, so only the files that were absolutely necessary were moved over, with the intention of keeping the EC2 instance as clean as possible.  But at the end of it all, the Intellog search application was up and running on EC2, ready for integration with the rest of the Intellog website.

Code Shavings  Installation of the Java runtime is a single step — it’s only necessary to sh jre-6u11-linux-i586-rpm.bin, and Java is installed and configured correctly.  ♦  When the EBS volume was mounted, along with the index for all the ST1/ST49s, it was necessary to clear the cache on Firefox before it would recognize the restart, and the change of index.

*As read by Silvio Dante, imitating the aging Michael Corleone.

Posted on 9th December 2008
Under: Developers' Journal | 1 Comment »

Well Completion Reports are Public Domain

During a recent presentation to a Calgary-based well service company, the audience expressed a concern as to whether it was within their purview to publish details of their customer’s treatment information.  Unlike other aspects of the industry where there are government-mandated filings and disclosure, much of what well service companies provide does not require a government filing, and therefore, not in the public record.

However, every well drilled in the Western Sedimentary Basin requires a Well Completion Report to be filed with the respective governing agency.  These reports (with a very small number of specific exceptions) are in the public domain.  Furthermore, they provide detailed information of the drilling operations, and include a significant amount of detail with respect to service companies’ contributions to these projects. 

To establish the content and accessibility of these reports, four wells from the four western Canadian provinces were selected at random.  For each well, a request was made to the respective governing agency with jurisdiction.  In all cases, the agencies responded with paper, faxed or in some cases PDF versions of the report.  The information below presents what was received.  The reports can be viewed by clicking on the PDF links below;

  Location Name Report
BC b-59-C/94-H-12 Tusk et al Conroy PDF (8.4 MB)
Alberta 03-28-060-19W5 Celtic KaybobS 3-28-60-19-W5 PDF (2.2 MB)
Saskatchewan 16-20-023-01W3 Panterra Eyebrow Lake 16-20 PDF (1.6 MB)
Manitoba 15-12-008-29W1 Tundra Sinclair 15-12-8-29 PDF (1.9 MB)

Each of the reports provides significant detail with respect to the completion.  In the case of BC, Alberta and Manitoba, they  contain specific information about the frac treatment performed as part of the well completion, including pumping pressures, proppants, fluids and other specific details.  The Saskatchewan report does not include any details of the frac treatment known to have occurred later in the same year.  This is likely because it was an operation entirely separate from the original well completion.

The format, location and organization of these reports effectively puts them out of reach of a majority of their potential audience, and furthers the perception of their confidentiality.  However, this exercise provides anecdotal proof that much of the detail service companies may not otherwise publish is already in the public domain, if a given individual has the time and is willing to make the effort to obtain it.

Posted on 5th December 2008
Under: Business Development | No Comments »

Bolting SOLR/Lucene Into Roundabout Search

It’s one month ago, less a day, since I posted More SimpleDB Explorer Experience and the First ‘Real’ Search, which described the first keyword search of ST1/ST49s using an index implementation based on SimpleDB.  It was shortly after when Mocky Habeeb suggested — very politely, I must add — that the cure for what ailed me likely already existed in Lucene and SOLR.  Well, after taking the comment to heart, and wrestling with the ages-old reuse-or-build-from-scratch question, here I am in in the final stages of bolting Lucene/SOLR functionality into the search within Intellog’s Roundabout application.   A month seems like a long time to invest, but compared to the functionality gained, and the amount of code I don’t have to maintain, it’s like George said to Jerry "Are you crazy?  This is like discovering plutonium…by accident!"

Connecting the Lucene/SOLR functionality into the search application turned out to be much easier than expected.  It really boiled down to acting on an XML result set coming back from SOLR, as opposed to SimpleDB.  In turn, this meant modifying exactly one XSL, called mergeAndStyleResponse.xsl.   The JOIN functionality, described previously, remains intact.  All I expect SOLR to return, for now, is/are the uid(s) of the document(s) resulting from the search criteria provided.  The rest of the information about the document is contained in a separate XML called Library.xml, which is generated by the SQL Server stored procedure E.getLibraryXml.  Slight modifications where also required to inputSearchCriteria.php and outputSearchResult.php, but nothing significant.  All told, in an hour or two, I was inputting Lucence-compatible search syntax, and getting results back from the previously-created index of the full text of the ST1/ST49. 

Posted on 4th December 2008
Under: Developers' Journal | 4 Comments »

Setting Up SOLR on EC2, Part V

The cliffhanger in Part IV was the rewrite of wrap.sh so "[i]t will…retrieve the file-to-be-indexed with curl, put it into a temporary location while it’s fed to SOLR, and  then move onto the next file."  Well that happened — sort of.  In the end, I decided to divide the wrapping process into two steps; first, the new wrap.sh, which simply retrieves the text from the URL passed as the second parameter, wraps it in the appropriate XML, and then writes it to the local directory under the name of the first parameter (ie. <uid>.xml)  As a result, there were 5600+ files XML files after this step completed — one for each document to be indexed.  Then, the second step, index.sh hands the uniquely named file to the SOLR update URL.  Both wrap.sh and index.sh were fed the list of parameters using awk.  The net result was 5633 documents to index, and when complete, they were all there — the count matched exactly.  And search results were absolutely stellar.

The final step of the configuration process, at least for now, will be to assign a semi-permanent IP address to an instance of the server.  This is what Amazon refers to as an Elastic IP Address.  They’re very easy to set up — all you have to do is follow Tutorial #4 in the Elasticfox Getting Started Guide.  One thing to keep in mind is the charge for an Elastic IP will seem a little odd.  You are charged $0.01 per hour whenever it’s not associated with an instance, and nothing additional when it is.  Seems backwards, but it’s likely Amazon is simply discouraging IP squatting.  Also, there are very few IP addresses available to standard AWS accounts — just five, at last count, unless you go through an application process and a suitable hazing, presumably.

With the creation of a third version of the image, as described in Part I, the configuration of SOLR for EC2 is complete, at least for the time being.  It obviously could use a ton of refinement, but its current state is more than adequate to address the specific application requirements I have right now.  If you’re interested in getting a hold of the image so you can try SOLR on EC2 yourself, by all means, leave a comment below, and I’ll get back to you. 

On to the integration of SOLR with the Intellog website so that the user community can access the ST1/ST49 reports through the index.

Code Shavings  As noted in Part II, the SOLR index information was relocated to an EBS volume, and solrconfig.xml modified accordingly.  Interestingly enough, you can subsequently forget to mount the EBS volume and SOLR still starts, seemingly without complaint.  Once you realize there’s nothing in the index, and mount the EBS volume, you still have to remember to restart SOLR in order to see the index contents.  ♦  The index for the 5633 documents is about 35 MB compared to 148 MB for the source data, for a ratio of about 23%.  ♦  Restarting an instance umounts the EBS volume(s), and I’m still working on how to automatically mount them when starting an instance.

Posted on 3rd December 2008
Under: Developers' Journal | No Comments »

Setting up SOLR on EC2, Part IV

In Part III, the major ‘breakthrough’ (in understanding, at least) was to come around to the idea that everything sent to SOLR is wrapped in XML, including the text to be indexed.  The next part of the exercise, therefore, was to wrap the plain text of the ST1/ST49 reports in XML, and use the resulting package to create the SOLR index. 

The SQL Server* table base.Document was, is, and will continue to be used to map documents to their respective immutable uids.  In this exercise, it was used to generate a subset of ST1/ST49 URL/uid pairs; the URL pointing to the physical file to be indexed, and the uid uniquely identifying this document in the known universe.  These pairs were, in turn, fed to the newly-created shell script wrap.sh, as shown;

echo 'curl -s http://localhost:8983/solr/update/ --data-binary '\''<add><doc><field name="uid">'$1'</field><field name="txt">'
curl -s $2 | sed -e 's/\&/\&amp;/g' | sed -e 's/</\&lt;/g' | sed -e 's/>/\&gt;/g' | sed -e 's/"/\&#34;/g'  | sed -e 's/'\''/\&#39;/g'
echo '</field></doc></add>'\'' -H '\''Content-type:text/xml; charset=utf-8'\'
echo 'curl -s http://localhost:8983/solr/update/ --data-binary '\''<commit/>'\'' -H '\''Content-type:text/xml; charset=utf-8'\'

wrap.sh generates executable shell script text with echo commands, which can be piped directly to the command line, or stored in a file and executed as a batch.  The line with the sed statements take the text of the file retrieved by the curl statement, and substitutes XML-friendly text for XML-unfriendly characters such as ‘<’ (less than) and ‘>’ (greater than).  The last line send the commit command to SOLR, which may or may not be necessary, but it doesn’t hurt either.

The first group of test files, 1000 in number, produced 989 documents in the index, which was pretty close to what I was expecting.  But not close enough to ignore the fact 11 documents had not made it into the index for some reason.  Closer examination revealed the longer ST1/ST49 documents were causing the text passed to the Linux command line interpreter to exceeding the maximum allowable limit.

Therefore, some way ’round the limitation had to be found, which unfortunately, put me back where I was last time — looking for a way to point to a text file, and have that form the basis of the index.  Eventually, I found the following syntax;

curl http://localhost:8983/solr/update/ --data-binary @myXml.xml -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/update/ --data-binary '<commit/>' -H 'Content-type:text/xml; charset=utf-8'

where myXml.xml contains the XML-wrapped, plain-text from the ST1/ST49 to be indexed.  This implies modifying wrap.sh somewhat.  It will have to retrieve the file-to-be-indexed with curl, put it into a temporary location while it’s fed to SOLR, and  then move onto the next file.  There may even be a better way of doing it, but that will do, for now.

Code Shavings  Although I didn’t use it the first time ’round, xargs would be a great way to pass the list of parameter pairs taken from the base.Document table and feed them to wrap.sh.  ♦  Testing of the finished index demonstrated that search performance is going to be spectacular, and highly functional.  However, the first attempts at using wildcarding didn’t work for some reason.

*Soon to be either MySQL or SimpleDB, but that’s another store for another day.

Posted on 2nd December 2008
Under: Developers' Journal | 1 Comment »