Archive for November, 2008

Setting Up SOLR on EC2, Part III

Parts I and II of this series covered initial set up of SOLR on EC2, and configuring EBS to store index information, so it would persist beyond the life of the instance itself.  This part covers finding a (semi-)permanent home for SOLR on the EC2 instance, and manipulation of schema.xml to suit the ST1/ST49 reports which will be Intellog’s initial focus of SOLR on EC2.

The (semi-)permanent home for SOLR was established by copying the directory to /root/apache-solr-1.3.0/example to the directory /usr/jetty.  The logic supporting this (or lack thereof?) was user-specific objects generally are found in /usr, and Jetty is the servlet container in which SOLR is delivered.  The full path to schema.xml was therefore /usr/jetty/solr/conf/schema.xml, which just looks right to me.  This approach might come to grief in the event there are multiple instances of SOLR running on a given EC2 instance, but for the moment, I’m thinking one SOLR per EC2 instance.

What confused me, initially, was SOLR’s single-minded focus on XML as the means of interacting with its internal Lucene search engine.  Regardless of the fact XML-as-wrapper-around-Lucene is really SOLR’s entire raison d’etre, I found myself looking for the ‘how to’ on uploading a plain text document.  The less evident the answer seemed to be (an hour-and-change on Google yielded nothing obvious), the more I thought I must be thinking about things in fundamentally the wrong way.  Then it dawned on me — yet another blinding flash of the obvious — wrap the plain text of the ST1/ST49 in an XML package, and send that to SOLR.  Within moments of the ‘insight’, the documents were going into SOLR with little if any trouble at all.

I had also assumed that schema.xml would only be used to take advantage of SOLR extensibility features, so my first reaction was not to use it at all — after all, you don’t need anything like it to use regular ol’ Lucene, right?  However, as described in Part II of this series, schema.xml is integral to the operation of SOLR.  There is no latter without the former.  In fact, what helped me was to alter my perception of SOLR a little; it’s not just a REST-like wrapper for Lucene, but also a mechanism for adding structure to the data being indexed, as well.  While they are obviously interdependent, but they are two distinct concepts, in my mind.

With that realization, I still liked the idea of starting with a minimal schema.xml similar to the one in Chris Hostetters’ presentation to ApacheCon 2008.  Then, add only what was absolutely necessary to index the full text of the ST1/ST49s.  I assumed this could be done with just two fields; uid for the unique identifiers previously assigned to the documents to be indexed, and txt for the full text of the related document.  There is a gotcha, however; to go with the two fields, I assumed I would need just two field types; also named uid of the solr.UUIDField class and txt of the solr.TextField class respectively.  However, when an attempt was made to remove <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> from schema.xml, the configuration would fail.  So in addition to the two fields required for ST1/ST49, it seems like it’s obligatory to keep this other field type in the mix as well.

A few other miscellaneous tips on those (like me) who are new to schema.xml and SOLR in general;

  • Each time schema.xml is modified, it’s necessary to restart SOLR. I haven’t yet found a better of doing this other than break the session (Ctrl-C), and then re-running the java -jar start.jar command.  But it seems like there should be one. 
  • The ‘minimal’ schema.xml, from Chris’ presentation, says the <uniqueKey>uid</uniqueKey> tag is "a good idea, but not strictly neccessary". I tried removing that tag, but the configuration seemed to fail every time.  So it may be it’s mandatory, now.
  • The update verb is used for both updates and additions to the index.  I’m embarrassed to admit how long it took me to figure out there is no add verb, but the latter is embedded into the XML passed to SOLR with the update verb.

Code Shavings  Field naming is taken directly from the Intellog database naming standards employed, which are based on recommendations found at butzi.ca/tech.  ♦  There is great, very detailed article by Paul Bramscher on the Installation of SOLR on Ubuntu Linux, which provided some excellent background — thanks, Paul.  ♦  I hadn’t used it before, but curl is a great way of interacting with SOLR during the setup process.  As described on the man page, "curl is a tool to transfer data from or to a server, using one of the supported protocols (HTTP, HTTPS [etc.]). The command is designed to work without user interaction."  In other words, a command-line equivalent/replacement for the browser.

Posted on 28th November 2008
Under: Developers' Journal | 2 Comments »

Setting Up SOLR on EC2, Part II

In Part I, initial SOLR setup was undertaken on an instance of EC2, and an AMI called intellogV2 was created to preserve implementation progress to that point.  However, Amazon themselves describe instance storage as "ephemeral".  Therefore, it’s not well suited to the storage of SOLR index information, assuming the index data is updated more-or-less constantly.  As a result, the next objective was to set up an Elastic Block Store (EBS), which is the EC2’s solution for longer-term storage of data.  Or as Amazon describes it, "off-instance storage that persists independently from the life of an instance."  Perfect!

Setting up the first, one gigabyte* EBS volume was accomplished — mostly — by following Tutorial #3 in the Elasticfox Getting Started Guide (GSG), tempered with the following thoughts and comments;

  • In 2.4, there’s reference to "Linux/UNIX instances currently support devices ‘sdf‘ to ‘sdh‘", without a lot of explanation as to which specific device should be used.  So the example shown, /dev/sdh, was used, with a resolution to go back and fill in this knowledge gap, at some point.
  • In 3.2, there is a reference to mk2fs, which does not seem to exist in Fedora.  Instead, there is a utility mke2fs.  However, it’s not actually necessary to use the utility at all.  Simply following "Using an Amazon EBS Volume within an Instance" in the Developer Guide, was all that was required  That done, there was a new /mnt/data-store on the running intellogV2 instance, which is where the SOLR index data will reside.
  • My first shot at mounting the newly-created volume failed because the running instance and the EBS volume were started in different availability zones.  This was easily rectified by restarting the instance in the same availability zone as the EBS volume.  What’s interesting is simply accepting the defaults during the instance start and/or the volume creation returned two different availability zones, so a little bit of attention has to be paid to the setup process.

Configuring SOLR to use the new volume was very straightforward.  The configuration file /root/apache-solr-1.3.0/example/solr/conf/solrconfig.xml contains an XML tag named <dataDir>.  The only change required is modification of the tag to read <dataDir>${solr.data.dir:/mnt/data-store}</dataDir>, where /mnt/data-store is the new EBS volume described above.  When SOLR is started with java -jar start.jar, it discovers there’s no index on the new volume, and automagically creates the necessary structure on the EBS volume.  That’s it — all index activity occurs on the EBS volume.

Based on the theory Lucene index data is compatible across platforms, the previously described ST1 and ST49 index created on the local Windows client was simply moved over to /mnt/data-store.  While the SOLR administrative interface reported the correct number of documents (about 5600, or so), nothing was returned in response to test queries.  This seemed to illustrate the role of heretofore overlooked schema.xml, also found in the configuration directory noted above.  The ST1/ST49 index was created absent any thought of schema.xml — the Lucene defaults were simply accepted.  Furthermore, the SOLR documentation talks about reindexing SOLR whenever schema.xml changes occur.  While contemplating how to to accomplish the reindexing step, it struck me it should be possible to simply build the SOLR index right on the EC2 instance, directly from the source documents themselves.  However, this will require some modification to schema.xml to account for the pure text format of the ST1/ST49 reports.

According to the documentation, unmounting the volume can be accomplished by simply terminating the EC2 instance.  However, it’s unknown what impact this would have on an index running an update, so it seems like it’s best to unmount the volume manually prior to shutting the instance down.  At least that will be necessary for the production configuration, it not right now. 

Next up, modify the schema.xml as described immediately above, and finding a permanent home from SOLR on the intellogV2 AMI.

*What’s Jeff going to do with the $0.10 (ten cents) per month this is going to generate?!

Posted on 27th November 2008
Under: Developers' Journal | 1 Comment »

Setting Up SOLR on EC2, Part I

It’s intended to use the Lucene search engine, as implemented by SOLR, to provide search functionality on the Intellog website, and furthermore, it’s intended to host it on Amazon’s EC2Last time, first impressions of EC2 were described, as were the initial steps in the EC2 setup process.  This post will continue with that, and describe the specific steps used to achieve an EC2-based platform for SOLR.

The initial objective was to start with a standard, AWS-supplied AMI (specifically ami-2e5fba47), configure it, and then create a customized image once SOLR had been installed.  The resulting image could then be used to start one or more EC2 instances, each of which represents a virtual machine.  Not only should the resulting implementation be robust, but at the first signs of search requests slowing down under load, another instance can be launched from the same image, instantly increasing the capacity to process queries.  Hot dang!

Linux  Getting a garden-variety Linux instance running on EC2 is as easy as following Tutorial #1 in the Elasticfox Getting Started Guide (GSG), described in the previous post.  The first clear indication you have successfully launched a Linux instance is there is a new machine available at http://ec2-##-###-###-###.compute-1.amazonaws.com.  It won’t be exactly as shown (hence the not-so-hyperlink) because EC2 will generate the URL automatically when the instance is started, with digits replacing the # symbols.  Further proof of success is the creation/modification of index.html in /var/www/html on the running instance will yield changes to the default home page for that running instance.

Java Runtime Environment and SOLR Package Install  The packages jre-6u10-linux-i586-rpm.bin and  apache-solr-1.3.0.tgz were downloaded to the local client, and moved to /root on the newly created instance using a secure session of FileZilla.   Initially, I thought it was necessary to install Tomcat, as well, but it turns out SOLR comes with its own servlet container, Jetty, bundled withe package above.   Installing the packages was aided by the articles How to Install Java on Linux, and How to Install Linux / UNIX *.tar.gz Tarball Files, respectively.  Both packages were installed under /root, although I suspect there will be a more appropriate location chosen at some point in the future.

Initial SOLR Testing  By this point, at least in theory, SOLR should have been able to process queries.  It was, sort of;

  • Executing  java -jar start.jar  while in /root/apache-solr-1.3.0/example should have started SOLR, and feedback to the console tended to indicate same.  However, the URL http://ec2-##-###-###-###.compute-1.amazonaws:8983/solr/select/?q=opengl&indent=on, with the intention of finding all documents containing the term opengl, did not return anything, even with sample data in the index.
  • To remedy the above, the All Incoming Security Group was modified to open port 8983, which the port that SOLR listens on by default.  It’s not necessary to reboot the instance in order for this change to take effect.  I’m not sure this is entirely secure, but acceptable during this configuration and experimentation stage.
  • After opening the port, not only did the search work as expected — and thus proving SOLR was basically operational — but it was also possible to access the SOLR administrative interface from the Windows client without having to install Firefox on the EC2 instance.

Setup for Creating the Custom Images  Satisfied the instance was basically configured correctly, attention turned to Tutorial #2 in the Elasticfox GSG, which covers the creation of custom images.  Actually, I can save you the trouble.  If you skip ahead to page 20 of the PDF, bundling and saving of Linux images is not really supported by Elasticfox (kaboom!), and has to be done with the API/AMI tools, instead.  In other words, using some combination of command-line syntax in the spirit of MS-DOS.  This isn’t difficult, just tedious.  A few tips, based on the experience of creating a couple of images;

  • EC2 instance images are stored on S3, so if you haven’t signed up for that AWS service, you best do that, and furthermore, create a bucket.  There were several existing buckets I could have chosen to use, but decided instead to dedicate one, new bucket specifically for Intellog EC2 images.
  • You need to install two sets of command-line tools, if you haven’t done so already; API on the client (Windows, in my case) and AMI on the running Linux instance.  In the case of the AWS image that was used as a base (ami-2e5fba47), the AMI tools were already installed.
  • You will need an X.509 certificate for using the command-line tools, so if you haven’t done that, you should go through the procedure to create or upload one.
  • Installing the command-line tools means a fair number of environment variables (eg. EC2_HOME etc) need to be configured.  On a Windows client, use the Control Panel so you can set, and subsequently forget, these variables.  You definitely don’t want to set them by hand ever time you want to use the tools.  Also, don’t forget JAVA_HOME needs to be set so it points to the Java Runtime Environment you have installed.

Creating the Custom Images  Basically, this can be accomplished by following the Creating an Image article from AWS, along with the following thoughts;

  • Prior to creating the image, you are required to copy the private key and X.509 certificate files to the /mnt directory on the running instance.  It just doesn’t feel right, to me, copying these two files anywhere other than the client.  That concern notwithstanding, the /mnt directory is automatically excluded from the image when it is created.  Therefore the copied key and certificate files die when you shut the instance down after creating the image.
  • The Bundling step from the article takes a surprisingly long time to complete — seven-to-ten minutes — so it’s important to be patient, and wait until the last of the screen feedback has showed, and you’ve got the # prompt back.
  • ec2-upload-bundle works pretty much as described in the AWS article noted above, but the syntax is pretty verbose, what with the bucket name, the access keys and the like.
  • My episodic dyslexia kicked in again, and I struggled with the format of the  ec2-register command.  Just keep mind you only have to specify the base bucket name, followed by the name of the manifest file.  For example, assume your S3 bucket is called empty, and your image is called emptyV1, then the complete command is ec2-register empty/emptyV1.manifest.xml
  • ec2-register is installed on the client side, but not on the instance, at least not by default.  So you will have to run ec2-register from the former, rather than the latter.
  • Also, keep in mind that it’s tempting to save lots of images, but don’t forget there will be S3 storage charges for each image (and they’re fairly large).  S3 storage is pretty cheap, however.

Like most of this stuff, much longer to explain than actually do!  Next time, implementing the Elastic Block Store (EBS) for the storage of SOLR index information.

Posted on 26th November 2008
Under: Developers' Journal | 23 Comments »

First Look at Amazon’s Elastic Compute Cloud (EC2)

The previous post described SOLR as a mechanism for implementing full text search on the Intellog website.  It is built on top of Lucene, and uses REST as an interface.  So now the effort shifts to finding a place to host SOLR, which is what Amazon’s Elastic Compute Cloud (EC2) is intended to provide.  In essence, it’s hot ‘n cold running Linux (and Windows) servers, or as Amazon describes it; "a web service that provides resizable compute capacity in the cloud."  Not only will EC2 be a good way to test out the SOLR solution, but will likely serve as a method of rolling out the functionality into a production environment, although there is a cash cost consideration with respect to the latter.

First things first.  Accessing EC2 requires an Amazon account, of course.  Aren’t these issued at birth these days?  But even if you already have one of those and you’re already using other Amazon web services, you still need to sign up for the EC2 service separately.  There’s no charge until you actually use the service (see Code Shavings, below).

Administration of EC2 can be done completely with REST-based web services.  Initially, however, it was much more straightforward to download and install the Firefox add-in called Elasticfox, which provides a graphical user interface to the EC2 infrastructure.  I chose to skip forward directly to the Elasticfox Getting Started Guide (GSG), which provides precisely the level of detail required to configure an initial instance of EC2, including four tutorials that cover the most important capabilities.  The guide is a little out of date, and omits some details, in spots;

  • Elasticfox accesses running server instances with a secure shell (ssh), yet the GSG does not mention on Windows clients, installation and configuration of ssh software is required*.  Amazon recommends PuTTY, which can be obtained directly from the PuTTY website.  Reading and following the configuration details in the PuTTY-related appendix is also well worth it, and it’s easiest to get all of this out of way prior to starting the GSG.
  • The Add Permission dialogue box, as described on the bottom of page 12**, and the top of page 13 of the GSG, has changed a little, no longer presenting a CIDR text box.

But with that, and a few other minor details, I still managed to get through Tutorial #1, which configures, launches, accesses and shuts down a Linux instance.  One potential downside to the EC2-based implementation of SOLR is it will be necessary to run — of course — 24 hours a day, seven days a week.  In effect, this establishes a mininum cost for operating this service, before the first user connects, of $0.10 per ‘instance-hour’, multiplied by 24 hours per day ($2.40), multiplied by 365 days per year, for a grand total of $876 per year.  Still pretty reasonable, particularly when it’s considered there is zero up front cost, and a great scalability story as well.

Code Shavings  An afternoon’s experimentation with EC2 — which included launching a variety of different configurations as described above, cost a grand total of $0.52.  That’s fifty-two cents.  Compare that to what it would cost to purchase the hardware, acquire the software, install it all, not to mention all the time required, and you’ll figure you’ve hit the jackpot already.  I know I did.  ♦  There are a long list of available Amazon Machine Images (AMIs) available.  The one that I chose as a place to start was ami-2e5fba47, which is further described as ec2-public-images/fedora-core4-apache-v1.07.

*This problem was likely precipitated by skipping over the preliminary material provided in the EC2 technical documentation.  You may not want to make the same mistake.

**Note: All references to page numbers are PDF page numbers, not the printed page number.  For example, page 12 of the PDF, if and when the document were ever printed, would actually be page eight.

Posted on 24th November 2008
Under: Developers' Journal | 1 Comment »

Ever Get The Feeling You’re Reinventing Something?

Oldreive's New Iron Horse And furthermore, that whatever it was probably didn’t need it?  Unfortunately, for the second time in almost as many days, I found myself heading down a planned technological route, only to discover not only has it been travelled before, but the previous traveller had a better idea of where they were going, had arrived at the destination, and in all likelihood had already moved on to the next town.

So it was as I contemplated building a simplified, REST-based front end for Lucene.  This already exists, and it’s called SOLR* (pronounced ’solar’).  In fairness to him (and me),  Mocky had suggested this a couple of days ago.  At the time — and boy, does it seems like a lifetime ago — I was still digesting the REST concept, and it’s relationship to the specific technical objective on the table.  As I worked through that, it became increasingly obvious that all Googled roads were leading back to SOLR.

Some examination of the SOLR background material revealed it was precisely the concept I was intending to pursue, except more so.  In other words, not only would it provide a REST interface which would enable a connection between the inputSearchCriteria.php page and a Lucene-implemented index, but also provides the rest of the REST interface to Lucene.  So when the time inevitable comes when other aspects of the search engine need to be implemented, such as user interface for document upload, the rest of the functionality is ready to go.  Therefore, the next job is to figure out the best and most effective way of hosting SOLR, and making the connection between it and the search criteria page.

Making these ‘discoveries’ is actually nothing but great news — and the earlier in the process the better.  The hours, days and maybe even weeks worth of development time that could have been sunk into a homebrew REST solution can now be dedicated to artifacts much closer to the customer experience of the Intellog application.  REST is still going to be used extensively in implementation efforts, so the time invested in it, was well worth it.

*From the SOLR website; "Arguably, it stands for ‘Searching On Lucene w/Replication’ — but it should not be considered an acronym."

Posted on 21st November 2008
Under: Developers' Journal | 1 Comment »

RESTful Front End for Lucene: Bare Bones Functionality

Yesterday’s post described the setup of the Java development environment, based on NetBeans, with the subsequent objective of wrapping Lucene in a simple, REST-based webservice.

The 60 Second tutorial resulted in a ‘Hello World’ REST webservice, and for the time being, I decided to simply transplant the code from the Lucene-supplied sample into it.   What little difficulty was encountered was really based on a lack of familiarity with NetBeans and Java, rather than any significant technical problems.  However, the Lucene sample source code did seem a little out of date, making references to deprecated classes and members.  NetBeans provides fairly good feedback in this regard, however.  When suggested — by text presented with the strikethrough decoration — the Javadocs provide good information as to what the logical replacement for the deprecated member should be.  And –  horrors! — the default options for NetBeans force the implementation of try-catch blocks.  No putting off that little chore, any more.

One other minor stumbling block was getting the Javadocs to show up for the Lucene class library.  Lacking familiarity with creating new libraries, I figured it was best to simply add the library to the project using the Project Properties dialogue box.  But, for the life of me, I couldn’t get the Javadocs to show up.  Clued-in by an article by Sang Shin, I created a Lucene library with NetBeans’ Library Manager.  The latter provides a tab for Javadocs, which turned out to be what I was missing, and the Lucene Javadocs began to show up where expected.

By the end, though, the web service was sending the query string to Lucene, running the query, and then iterating over the results and sending them back to the browser.  While there is lots more work to do, this is the bare-bones-essential functionality required, and it was possible to implement it with remarkably little new code.  A couple of dozen lines, at most.  Next up, refinement, and then deployment to some sort of beta environment.

Code Shavings  Kelvin Tan at Supermind maintains a pretty good, if nascent, Lucene education website.  It’s  found at LuceneTutorial.com, and is intended to "[j]umpstart…your Lucene knowledge"  ♦  Adding the Lucene library to the Java project was aided by an article found at the Nokia Forum and Chapter 5 of JavaTech course.

Posted on 20th November 2008
Under: Developers' Journal | No Comments »

RESTful Front End for Lucene : Development Environment

http://lucene.apache.org/ In the previous post, the early testing of Lucene yielded great results — it was possible to accomplish in a morning what had not yet been accomplished in the preceding week of effort.  So it looks like this is the road to go down to implement full text search on the Intellog website, starting with the ERCB’s ST1 and ST49 reports.  The initial goal of the implementation is to build a webservice, based on REST, which accepts Lucene query syntax, and returns results in XML.  The XML will then be transformed with XSLT into renderable HTML, which will subsequently be styled with CSS.  This should provide for seamless integration into the look-n-feel of the Intellog website.

The title — and the promise — of the tutorial RESTful Web Service in 60 Seconds was just too good to pass up, and it served a guide for getting bootstrapped into the Java world.  It called for the installation of NetBeans and the Java JDK (see Code Shavings for specific versions), which was done.  After that, the tutorial recommends some reading from the Sun website (specific links provided below, as well) regarding REST in Java in general.  Then, the tutorial was followed verbatim, and it delivered what it promised, or pretty close.  In a few minutes, there was a working REST webservice of the ‘HelloWorld’ variety.  It’s not often these tutorials deliver on their promise, but this one did.  Thanks, Meera.

The setup of the development environment, and creating the initial example really couldn’t have gone more smoothly, so it was on to the sample which comes along with Lucene.  There is an ‘all-in-one’ type thing which provides an illustration of how to access Lucene programmatically from within Java.  As they say in the example, it’s style and substance is nothing to write home about, but it’s definitely the example that was need to get going.

Code Shavings   The tutorial noted above recommends a couple of articles from the Sun website, and they are well worth reading; RESTful Web Services, and Implementing RESTful Web Services in Java.  ♦  The specific MSIs for NetBeans and the JDK were netbeans-6.1-ml-windows.msi* and  java_ee_sdk-5_06-windows.msi, and both versions were downloaded in their fully optioned configuration.  ♦  GlassFish, the Sun application server, is V2 UR2, but it came along with the JDK and was not a separate install.  ♦  The governing document for REST in Java is JSR 311.

*Although if you navigate to the 6.1 page now, it automatically takes you to the 6.5 page.  It would appear as though 6.5 has gone from release candidate status to production.

Posted on 19th November 2008
Under: Developers' Journal | 2 Comments »

Lucene (or, ‘I Think Mocky May Have Been Right’)

In a recent discussion on the SimpleDB discussion forum, Mocky Habeeb provided an insightful response to my question about coding a SELECT DISTICT emulation, in pursuit of full-text search functionality.  To quote him; "[f]ull text search has been solved for a long time and sdb really does nothing to make it a prime choice."  I had a hard time shaking off his statement, regardless of the fact there was already a number of days invested in a SimpleDB-based indexing solution.  A couple of hours on Google over the weekend, and there were a couple of open source candidates for search engines worthy of some investigation before slinging more time at a home brew solution.

Lucene and Xapian percolated to the top of the pile.  Lucene is tied in with the whole Apache-plex, and is written entirely in Java, whereas Xapian is a solution based on C++, and seems to have a bit of patchwork history, having been passed through a number of hands on its journey to its current state.  Both appear very capable and have decent install bases.  Given the remote possibility of having to extend the search engine functionality at some point in the future, though, Java seems a whole lot more accessible than C++*.

Download of Lucene 2.4.0 was very straightforward, and having gotten over a transient case of dyslexia with respect to the Java CLASSPATH, the command-line demo of Lucene fired right up.  I was able to index all of the accumulated ST1s and ST49s in a couple of minutes, and the resulting index files were on the order of 25 MB in size for the roughly 5500 documents.  The documents themselves are about 150 MB, making the index about 20% of the original size.  Presently, I was able to issue Lucene query syntax against the indexed data, and getting exactly the results for which I was hoping.  Even the AND logic — which had proved so troublesome and consumed so much time recently — was up and running perfectly.  And all of this by noon.  While it’s hard to walk away from the time invested in the SimpleDB index, Lucene seems to have all the necessary full text search capability and it’s already built.

Attention then turned to establishing some sort of interface between the PHP front end, and the Java-based Lucene back end.  A solution appeared to be within grasp with the PHP Java Bridge, but once this was installed, along with Apache Tomcat 6.0.18 to host it, there was some sort of problem between the bridge and PHP resulting in a steady stream of "CGI / FastCGI has encountered a problem and needs to close" errors showing up.  Some investigation makes it appear as though it as known, open bug with PHP.  So the ‘bridge is closed’ for the time being. 

But then I began to think about the future, production deployment of the search engine.  Eventually — maybe sooner, rather than later — it will find its way onto its own server.  Maybe even onto multiple servers (on EC2?) in order to scale it up.  So, why not simply build a RESTful front-end for Lucene, which can be built in Java as well, and avoid the whole PHP-to-Java connection entirely.   In theory, this can be very simple; something to listen for incoming Lucence query syntax, and respond with an XML containing the search results.   The XML can then be styled in a manner very similar to a response coming back from SimpleDB, making the time invested in the XSLT worthwhile.

*That, coupled with a chance mention of Lucene in Scott Rosenberg’s Dreaming in Code.

Posted on 18th November 2008
Under: Developers' Journal | 5 Comments »

XML Lesson Learned: Sometimes Whitespace Does Count

It seems like the last week has been almost completely dominated by a solitary figure hunched over XSLT code with Notepad++.  Progress has been seemingly glacial, but consolation was taken from the fact XSLT was being bent in some ways which were previously unthinkable.  In other words, the learning curve was steep, but ultimately fairly productive.  Or in other, other words, I felt like I was learning a lot.  So it was time for a ‘break’ from the tough stuff to extract mail merge information from an XML with contact data.  It was nothing more complicated than looking for a tag populated with some specific text, and when that text was found, extract name, address and the like.  But just like the funny tripping man statue in the previous post, I was sent sprawling to the ground, having been caught by the code equivalent of a stair step just out of the allowable 1/4" tolerance.

The short version of this post is; with XML, whitespace sometimes does count, which was contrary to my thinking to date, in which whitespace was assumed away.  To that end, when I set up a recursive taxonomy structure in XML to classify contact information, I thought it would make perfect sense to have something like the following;

<Taxonomy>
        <lbl>
            Intellog
            <Taxonomy>
                <lbl>Customer</lbl>
            </Taxonomy>
        </lbl>
    </Taxonomy>

Contact information was being edited directly, so the indenting was used to keep the hierarchy of taxonomies organized so they were easily understood to the human reader.  lbl was the child of the element of Taxonomy, so indent it.  Intellog was a text node child of the lbl element, as was another Taxonomy element, so indent the two of those once more.  When the time came to extract customers who were categorized in a particular way, I assumed that the XPath query;

/Taxonomy/lbl[. = 'Intellog']

would return all the Taxonomy elements similar to the example shown above.  But it didn’t.  Not even close.  The first hint that all was not what it first appeared to be was changing the syntax to;

/Taxonomy/lbl[contains(., 'Intellog')]

which finally began to return the expected nodes.  But this still wasn’t satisfactory, because of the imprecise nature of the match.  Substrings of Intellog would also  return nodes, for example.  Then, to try and understand what was happening, the syntax;

/Taxonomy/lbl[string-length('Intellog')]

was coded, which returned the value 18.  WTF?  At first, it was assumed this was double the number of characters in the string, so I assumed it was  Unicode-related thing.  It only took about 40 minutes to realize that Intellog was eight letters, not nine, which would have thrown that theory out the window.  Anywho…the syntax which finally worked was;

/Taxonomy/lbl[normalize-space(.) = 'Intellog']

This led to the conclusion the first XPath syntax, based on equals (’='), was including the tabs and linefeed in the comparison.  The normalize-space() function eliminates all of that stuff, and the comparison works the way you would expect.

Who knew.

Posted on 16th November 2008
Under: Developers' Journal | No Comments »

Don’t You Just Hate It When That Happens?

Having just put all the pieces together to enable full text searches powered by an index hosted on SimpleDB, a wrench was thrown into the works.  Keywords linked by a logical ORs worked just fine, but surprisingly, terms linked by logical ANDs did not — they returned nothing.  Given only a subset of the index data had been loaded for testing purposes, it was assumed there really were no documents containing the two terms linked by the AND.  However, after carefully ensuring there should be one ‘hit’ for a given document based on two search terms, still no results.  Then it became evident there was a slight misunderstanding about how INTERSECTION worked in the SimpleDB query syntax.  Don’t you just hate it when that happens?  The ‘revised’ understanding was confirmed by a post-and-response to the SimpleDB forum.

Click for larger image.Parenthetically speaking, Mocky (the SimpleDB guru at AWS) did raise some concerns about the efficiency of the method used to store index information.  In response, another test was set up to combine the identifier and its associated uid into one attribute, separated by a tilde (~) symbol, the combination of which was collectively named keyTxt.  Upload performance wasn’t a great deal different (it was still being done on the crappy wireless LAN noted previously), but this approach will, of course, have a dramatic impact on the number of name-value pairs that will have to be stored in order to store the full index.  Note: In the diagram to the right, which documents the upload performance, don’t let the ‘S3′ be confusing — it has nothing to do with Amazon’s Simple Storage Service (S3), but rather, it’s simply the standard adopted for distinctively naming index upload sessions.

However, that’s a digression.  The job became one of emulating the AND functionality on the result set coming back from SimpleDB.  It seemed like a relatively simple to problem to solve — if there were two search terms linked with an AND, produce unique uids of the documents which contains both terms.  Solving this problem was made simpler by knowing that when the index is produced, each token or identifier appearing in a given document is only stored once.  So you construct the SimpleDB query based on an OR, separate the uid from its associated keyword by separating them on the tilde symbol, and then count the number of times the uid appears in the result set.  If it appears a number of times equal to the number of terms linked by the AND, then the document identified with the uid can be assumed to contain all of the ANDed search terms.

Believe it or not, the tool initially chosen to do the analysis of the result set described above was XSLT.  This was based on the assumption it was going to be in the picture, anyway, to transform the SimpleDB result set into HTML for presentation purposes.  So doing the equivalent of a relational SELECT DISTINCT cannot be all that difficult, right?  Oh, how wrong that assumption turned out to be.  After four days of struggle, that particular problem was still not yet solved, which makes one think XSLT is not likely the right tool for this particular type of task.  There was a strong temptation to keep hammering away at it, but with a variety of deadlines (and borderline stress-induced psychosis) looming, it was time to cry uncle, and move on.  The solution was to use two XSLTs; the first to strip out and sort the uids returned in the AWS result set, and the second to merge the uids with other information, ready for final presentation on the browser.  The missing link between the two is a few lines of PHP code which goes through the stripped out uids, and counts repeating groups.  Somewhat ugly, but it works.

Code Shavings  When I first tackled the ‘grouping’ problem described above, it appears as though XSLT 2.0 had some great features which would have handled it in a snap, as described by Bob DuCharme on xml.com.  Given the article was written over five years ago, I assumed XSLT 2.0 was widely available.  However, looks like XSLT 2.0 may have got stuck in VHS-vs-Betamax death spiral with XQuery.  ♦  Another possibility for the grouping problem was nodeset(), described by Jirka Kosek, once again on xml.com.  This article is also dated 2003, which seems a little strange, to me, begging the question ‘whatever happened to XSLT 2.0?’  ♦  Thanks to Melonfire on TechRepublic, for a great article on manipulating XML with PHP.

Posted on 14th November 2008
Under: Developers' Journal | 1 Comment »