XSL to Achieve Relational(-like) JOINs
The plan it to use Amazon’s SimpleDB to host the millions of rows that will result from the application of the concordance ideas described previously. Keywords gathered on the form inputSearchCriteria.php will be handed to SimpleDB, and it will hand back the Document.uid for the documents in which those keywords appear. But what of the other information the user is going to want to know, other than the cryptic uid? The URL where the document can be found is a good example of what they’ll want to, but also, the source of the information, perhaps the author, when it was modified last — the list goes on an on. SimpleDB does not claim to be a relational database, and would simply recommend bundling these additional values into the domain* with the uid and modify the query to return not only the uid, but the other values the users want to see when they hit a document.
There are a couple of problems with this approach. First, there is the potential for it to be costly — well, more accurately, not ‘cost free’. SimpleDB is a pay as you go system, so if you have to store the verbose URL 4000 times because there are 4000 keywords indexed in the related document, you are, in effect, paying for each character of each redundant copy of the URL to be stored. The amount paid per character is extremely small, but not zero. However, a bigger issue would be longer term maintenance of the SimpleDB-resident keywords. If at some point in the future, a new piece of information is associated with each indexed document, you potentially have to go back and touch each record in the SimpleDB domain — this could be millions of records. So, a much better approach would be to maintain a lookup table from which the other values can be retrieved based on the uid handed back from SimpleDB. Fairly traditional relational thinking, actually.
Wanting to stay away from an instance of MySQL for the time being for a variety of reasons, the choice was made to investigate whetheran XSL could be used to take the uid and retrieve values from an XML-based version of document library. Much to my surprise, and pleasure, it can. What’s more it would appear as the the xsl:key syntax even makes it an indexed lookup. The XML immediately below is an analogue for the result set coming back from SimpleDB;
<?xml version='1.0' encoding='UTF-8'?> <?xml:stylesheet type='text/xsl' href='test.xsl'?> <Result> <Hit> <uid>FFF4A3CB-BCF2-485F-942A-25200BA7AAB1</uid> </Hit> <Hit> <uid>8730192F-B35F-422B-9EA5-5FEC5ED491CE</uid> </Hit> <Hit> <uid>0008BA12-B476-4F42-8B11-7584271A2843</uid> </Hit> <Hit> <uid>906A68F6-7F5F-4A19-A026-A65CE8D21194</uid> </Hit> <Hit> <uid>906EB740-C62C-414E-BF57-49B9FD09B053</uid> </Hit> <Hit> <uid>906ED85B-5FFF-4474-861A-9417D3C6547D</uid> </Hit> </Result>
And the following XML is a fragment of the table from which values will be looked up, in this case the urlTxt element.
<Library> <Document> <uid>0008BA12-B476-4F42-8B11-7584271A2843</uid> <urlTxt>http://E.intellog.com/data/from/alberta/st1/2007/08/23.txt</urlTxt> </Document> <Document> <uid>000DA073-1E71-41C4-BD9A-61F063DF1497</uid> <urlTxt>http://E.intellog.com/data/from/alberta/st1/2001/07/26.txt</urlTxt> </Document> <Document> <uid>0013F4EF-333C-4E38-9BD2-10080307CD64</uid> <urlTxt>http://E.intellog.com/data/from/alberta/st49/2002/09/03.txt</urlTxt> </Document> <Document> <uid>0014F12E-BD48-474A-9FB9-14ABC0BF9808</uid> <urlTxt>http://E.intellog.com/data/from/alberta/st1/2002/02/27.txt</urlTxt> </Document> <Document> <uid>002440D7-E977-48F9-A031-D39135DE48CC</uid> <urlTxt>http://E.intellog.com/data/from/alberta/st1/2003/02/28.txt</urlTxt> </Document> <Document> <uid>0055402B-551F-41F6-9674-839F0FC5ADBB</uid> <urlTxt>http://E.intellog.com/data/from/alberta/st49/2008/05/19.txt</urlTxt> </Document> <Document> <uid>005B9E6A-AFC4-4532-B1CA-EB5D76715EA9</uid> <urlTxt>http://E.intellog.com/data/from/alberta/st1/2003/01/05.txt</urlTxt> </Document> . . . </Library
And then the following XSL;
<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:output
method="text"
omit-xml-declaration="yes"
/>
<xsl:key name="uid_Document" match="Document" use="uid"/>
<xsl:variable name="uid" select='document("Library.xml")/Library'/>
<xsl:template match="Hit">
<xsl:text>The uid </xsl:text>
<xsl:value-of select="uid"/>
<xsl:text> represents the URL </xsl:text>
<xsl:apply-templates select="$uid">
<xsl:with-param name="currentHit" select="."/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match="Library">
<xsl:param name="currentHit"/>
<xsl:value-of select="key('uid_Document', $currentHit/uid)/urlTxt"/>
<xsl:text>. </xsl:text>
</xsl:template>
</xsl:transform>
returns a result set consisting of the uid from the first first XML above, followed by the urlTxt from the second XML. Currently, there are about 5000+ Document elements in the lookup table XML, and to this point, the performance of the join appears to be quite good.
Code Shavings Much thanks and much of the credit goes to Uche Ogbuji at Fourthought for his great article on IBM’s developerWorks website. It provided the technical insight, and a clear example which served as a template upon which my work was heavily based.
*This is SimpleDB terminology that is roughly analogous to the table concept in a relational database.
Posted on 28th October 2008
Under: Developers' Journal | 1 Comment »
