Setting Up SOLR on EC2, Part III
Parts I and II of this series covered initial set up of SOLR on EC2, and configuring EBS to store index information, so it would persist beyond the life of the instance itself. This part covers finding a (semi-)permanent home for SOLR on the EC2 instance, and manipulation of schema.xml to suit the ST1/ST49 reports which will be Intellog’s initial focus of SOLR on EC2.
The (semi-)permanent home for SOLR was established by copying the directory to /root/apache-solr-1.3.0/example to the directory /usr/jetty. The logic supporting this (or lack thereof?) was user-specific objects generally are found in /usr, and Jetty is the servlet container in which SOLR is delivered. The full path to schema.xml was therefore /usr/jetty/solr/conf/schema.xml, which just looks right to me. This approach might come to grief in the event there are multiple instances of SOLR running on a given EC2 instance, but for the moment, I’m thinking one SOLR per EC2 instance.
What confused me, initially, was SOLR’s single-minded focus on XML as the means of interacting with its internal Lucene search engine. Regardless of the fact XML-as-wrapper-around-Lucene is really SOLR’s entire raison d’etre, I found myself looking for the ‘how to’ on uploading a plain text document. The less evident the answer seemed to be (an hour-and-change on Google yielded nothing obvious), the more I thought I must be thinking about things in fundamentally the wrong way. Then it dawned on me — yet another blinding flash of the obvious — wrap the plain text of the ST1/ST49 in an XML package, and send that to SOLR. Within moments of the ‘insight’, the documents were going into SOLR with little if any trouble at all.
I had also assumed that schema.xml would only be used to take advantage of SOLR extensibility features, so my first reaction was not to use it at all — after all, you don’t need anything like it to use regular ol’ Lucene, right? However, as described in Part II of this series, schema.xml is integral to the operation of SOLR. There is no latter without the former. In fact, what helped me was to alter my perception of SOLR a little; it’s not just a REST-like wrapper for Lucene, but also a mechanism for adding structure to the data being indexed, as well. While they are obviously interdependent, but they are two distinct concepts, in my mind.
With that realization, I still liked the idea of starting with a minimal schema.xml similar to the one in Chris Hostetters’ presentation to ApacheCon 2008. Then, add only what was absolutely necessary to index the full text of the ST1/ST49s. I assumed this could be done with just two fields; uid for the unique identifiers previously assigned to the documents to be indexed, and txt for the full text of the related document. There is a gotcha, however; to go with the two fields, I assumed I would need just two field types; also named uid of the solr.UUIDField class and txt of the solr.TextField class respectively. However, when an attempt was made to remove <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> from schema.xml, the configuration would fail. So in addition to the two fields required for ST1/ST49, it seems like it’s obligatory to keep this other field type in the mix as well.
A few other miscellaneous tips on those (like me) who are new to schema.xml and SOLR in general;
- Each time
schema.xmlis modified, it’s necessary to restart SOLR. I haven’t yet found a better of doing this other than break the session (Ctrl-C), and then re-running thejava -jar start.jarcommand. But it seems like there should be one. - The ‘minimal’
schema.xml, from Chris’ presentation, says the<uniqueKey>uid</uniqueKey>tag is "a good idea, but not strictly neccessary". I tried removing that tag, but the configuration seemed to fail every time. So it may be it’s mandatory, now. - The
updateverb is used for both updates and additions to the index. I’m embarrassed to admit how long it took me to figure out there is noaddverb, but the latter is embedded into the XML passed to SOLR with theupdateverb.
Code Shavings Field naming is taken directly from the Intellog database naming standards employed, which are based on recommendations found at butzi.ca/tech. ♦ There is great, very detailed article by Paul Bramscher on the Installation of SOLR on Ubuntu Linux, which provided some excellent background — thanks, Paul. ♦ I hadn’t used it before, but curl is a great way of interacting with SOLR during the setup process. As described on the man page, "curl is a tool to transfer data from or to a server, using one of the supported protocols (HTTP, HTTPS [etc.]). The command is designed to work without user interaction." In other words, a command-line equivalent/replacement for the browser.
Posted on 28th November 2008
Under: Developers' Journal | 2 Comments »


