Implementing Hit Highlighting with Solr*
At the conclusion of the previous post, the next technical objective was identified. This was to populate the description column of the result set with information sensible in the context of the type of document being shown in a particular row. In the case of completion reports, it makes sense to show the well location and its verbose name. It’s a little more problematic in the case of the ST1/ST49, however. After some consideration, it was determined that a snippet of the report surrounding the search term(s) would be useful. This will enable users to quickly scan down the result set and determine if the full report is worth a closer look. In Solr parlance, this is referred to as hit highlighting, or as I refer to it from time-to-time below, snippets. Getting this implemented was a matter of incrementally solving a series of small problems;
- The Solr administrative interface enables the testing of hit highlighting by simply clicking a checkbox, but when I did this — whoops — no snippet. However, in order for highlighting to work, the snippet has to come from somewhere. The entire text of each report is being indexed, so what gives? Turns out, of course, there’s a huge difference between parsing/indexing the entire document, and actually being able to retrieve the original text. In order to do the latter, you have to remember to set (to
true, that is) the stored attribute of the field element found in schema.xml. It seems odd to have to store the entire document twice, but disk space is cheap, and performance doesn’t seem to suffer. That done, the snippets began to appear pretty much as expected. - The layout of the response XML produced by Solr when highlighting is enabled simply adds a new lst element. It’s identified with name
highlighting. It then repeats the unique identifiers found in the original result element, and associates each with the snippet, enclosed in a str element. While it would have made more sense to me to embed the str element within result, merging the str element with the data found in result/doc can be accomplished through the use of the uid they share. This was accomplished in mergeAndStyleResponse.xsl using thexsl:keysyntax. - Even with the items above handled, there were still a few snippets not appearing. Some research revealed Solr only digs out the snippet for the number of characters specified by the URL parameter
hl.maxAnalyzedChars. By setting this value to 350,000, to cover the largest document indexed document, the problem was solved. I have no idea what the impact to performance having such a large value will be, but that’s a problem which can be dealt with down the road. - By default, Solr passes the snippet back with output escaping enabled. This means the ‘<’ (less than) and ‘>’ (greater than) symbols required by the <em></em> tags surrounding the hit are entitized. In other words, they show up surrounded with
>and<strings. This problem is easily solved by using thedisable-output-escaping='yes'syntax when transforming the the XML into renderable HTML. But the default tags weren’t what I wanted anyway — I wanted the hit to be bold, not italicized. So, I also added the parameters&hl.simple.pre=%3CB%3E&hl.simple.post=%3CB%2F%3E, which is an encoded version of the<B>and</B>tags. - Right now, the inputSeachCriteria.php form simply captures the user’s query information in the keywordTxa text box, and passes it right through to Solr unaltered and unedited. There’s a problem with that, though — assuming the user puts in multiple keywords, there are spaces between those keywords, which the application will attempt to pass in the URL. That doesn’t work, of course, but fortunately, the use of the PHP function
urlencodeautomatically processes the text into a URL compatible version. - That last minor glitch was the seeming inability of Safari to display some characters, and its propensity to substitute the little-black-diamond-with-a-question-mark-in-it. The very ugly fix for this was to use the
str_replacePHP syntax whenechoing out the transformed XML. Again, that’s something which likely needs to be revisited in the future, but it will do for now.
Code Shavings It may be in the documentation somewhere, but I discovered that reindexing the entire database effectively doubles the size of the index data on disk. Obviously, it’s retaining both the old version of the index records and adding the new ones. I would expect there to be a compaction utility or some such thing, but have not yet discovered what that is.
*Note the new usage of case for Solr. The new Solr logo which has been actively discussed on the Solr forum of late made particular note that Solr really has lost its original acronym context, and therefore should simply be referred to as a proper noun.
Posted on 22nd December 2008
Under: Developers' Journal | 6 Comments »
