Slightly Rethinking the Retrieval of URLs
Well, fortunately, not everything. However, as the PageIndex concept was coming together, it became obvious that keeping the URL in a separate file (Library.xml) made little or no sense. wrap.sh was wrapping all of the metadata (eg. document source & type, page sequence, etc.), but the URL was being rationalized out to Library.xml. As described previously, when the results of a search were being styled with searchResult.xsl, the uid was being used as a key to reach out to Library.xml for the associated URL.
The arguments in favour of making a change to this approach are as follows; 1) The logic of data retrieval and styling is getting mixed up in searchResult.xsl; they really are conceptually separate ideas. 2) The performance of the join logic will likely degrade when the number of documents increases and Library.xml gets very large. 3) Solr is doing a great job retrieving the rest of the metadata anyway, so it might as well haul back the URL, as well, and finally 4) keeping Library.xml in synch with the index doubles up on the maintenance. The main counter-arguments are 1) changing the location of the file referenced by the URL requires the entry to be reindexed in Solr*, and 2) it doesn’t ‘feel’ normalized, in a traditional RDBMS sense. The former is really no big deal, as Solr handles reindexing elegantly and quickly, and the latter is esoteric, at best. But old habits die hard.
Making the change started by adding some metadata (sizeAmt, sourceLbl, typeLbl and Udt) to the instances of base.Document referencing all of the ST1, ST49 and ST96 documents. This metadata was all stored in Document.xml. The procedure with the unwieldy-but-surprisingly-descriptive name ERCB.getReindexStatisticReportExecutableTxt was created to automate the generation of the content of wrapBatch.sh and index.sh for any instance of base.Document referencing any of the statistical ST reports. The most significant capability of getReindexStatisticReportExecutableTxt is the inclusion of urlTxt as well as all the other metadata in the output. wrapBatch.sh then feeds these parameter sets to wrap.sh, which creates a Solr-compatible XML. index.sh then feeds these XMLs to Solr, thence reindexing the entries. Solr’s schema.xml was edited to add the urlTxt field. A sample of 10 was tested, right through to ensuring urlTxt could be queried out of the Solr index. It all tested out OK, so the reindexing of the ST reports is ready to go. To this end, wrapBatch.sh was run against the entire set of ST reports, generating on XML per report — 8688 in total.
Code Shavings A new scalar-valued function, E.getTypeLbl was developed, which generates a standardized typeLbl from the characteristics of urlTxt. ♦ One of the bigger hassles was bringing in the file size of the statistical (ST) reports. I ended up hacking together a spreadsheet (fileSizeHack.xls), which takes the output of dir ??.txt /s, and massages it so you’re left with the path name and the size of the file. This was then imported into SQL Server as dbo.Filter$ (huh?), and then merged with base.Document. ♦ As noted above, adding the urlTxt to Solr necessitates the reindexing of entries. I figured while I was doing this, I might as well migrate the files from E.intellog.com/data to the new standard data.E.intellog.com, described a while back. ♦ Interesting fun fact — using BucketExplorer to change the Access Control for the objects in data.E.intellog.com took longer than uploading the objects themselves!
*Currently, Solr does not have the ability to update a single field in its index. The entire entry has to be reindexed. However, updating a single field is a capability which will likely find its way into the application in the future according to solr-user@lucene.apache.org.
Posted on 24th February 2009
Under: Developers' Journal | No Comments »
