Archive for September, 2008

Intellog Application Organization

Click for larger image. The diagram to the left (click for larger image) illustrates the overall conceptual organization of Intellog applications — it is based on the assumption there will be multiple applications in the Intellog product suite (shown as the symbols numbered 1 through 6), and these applications will eventually serve multiple markets (shown as the rectangles A through C).  A given market may have need for all the applications, or a subset thereof.

Applications can be described in terms of their properties*, each of which will fall into one of the following categories;

Application Consistent Properties (ACP)  The properties of a given application which are consistent, regardless of the market being served by the application.  Example:  In the case of a search-type application, the search terms will always be entered in a simple, Google-like box in the middle of the form.

Market Consistent Properties (MCP)  The properties of the application suite which are consistent within a given market.  Example: For a market which manages a lot of information geographically, specific buttons would be dedicated to selecting information by lat/long, and they would be consistently implemented for all the applications for that particular market. 

Universally Consistent Properties (UCP)  The properties of Intellog applications which are consistent across the entire application suite, regardless of market.  More than any other type of property, these will establish the ‘brand’ of Intellog applications.  Example:  The Intellog home page, which illustrates fixed header and footer, with a scrolling region in between them.

Application Unique Properties (AUP)  A property which is truly unique to a specific implementation of an application within a given market — which is to say any ‘intersection’ of the market/application matrix as shown in the diagram.  Example:  A button to file a well license would be available within a single application for the specific market to which this license is related.

It’s assumed there will be a relatively high probability of customers within a given market using more than one Intellog application.  At the same time, there is a relatively low probability a given customer will use a given application in two different markets.  Example: If a given customer is using the Intellog energy-industry version of the search application, there is a pretty good chance that customer will use the Intellog energy-industry data exchange application.  The chances of that same customer concurrently using the Intellog search application for the automotive industry if fairly low.

As it relates to the above, any implementation inconsistency within a given market is going to impact customer experience quickly, whereas inconsistency between industries is not likely to impact any given customer for a considerable period of time.  As a result, the category of a given application property will have a direct bearing on technical implementation.  This is so characteristics of the property can be efficiently managed.  If a trade-off must be made, it is more important for MCP to be managed from one central location, as opposed to ACP.  Example: Colours and fonts would be managed from a single css for a given market.  Change the one css, and it instantly changes the colours and fonts for all the applications for that market.  Conversely, the PHP code implementing the search application may have be copied from one application directory to another.  As a result, it is acceptable for it to take some time for these latter changes to propagate out.

*The term property in this context should not be confused with the like-named object-oriented terminology.  A property is simply a characteristic or feature of a given application.

Posted on 24th September 2008
Under: Developers' Journal | 1 Comment »

More XSL Transformation of SimpleDB Responses

Last time, it was casually mentioned there was a problem getting the XML to transform properly, and the problem seemed to be related to the xmlns found in the XML and XSL.  Turns out the description of the problem was erroneous; in order for the transformation to actually occur, it was necessary to remove the xmlns from the the XML returned by SimpleDB.

The first thought was manipulation of the SimpleDB response XML while it was instantiated as a DOMDocument.  When it turned out removing the xmlns didn’t seem possible using this method*, to say nothing of the technical implications of removing it, another search provided the breakthrough insight.  The problem was lack of a unique namespace reference in the XSL.  By adding the xmlns:aws line to the XSL fragment shown below, and then prefixing of query strings in the subsequent XSL code with the aws prefix, all was well;

<xsl:stylesheet
	version="1.0"
	xmlns="http://www.w3.org/1999/xhtml"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:aws="http://sdb.amazonaws.com/doc/2007-11-07/"
	type="text/xsl"
	>

In other words, the queries in the balance of the XSL code were missing their targets because the query <xsl:template match="/QueryWithAttributesResponse"> did not match any node.  The actual name of the node  <xsl:template match="/aws:QueryWithAttributesResponse">.  So in the end it appears as though it’s intended behaviour.  But easy to miss, to be sure.

Also, capturing one of the SimpleDB response XMLs and saving it on local storage is a way of saving some money, given each call to SimpleDB incurs a small micro-charge.  The code which is doing the server-side transformation (putSearch.php)  was set up to either read the locally-saved XML, or make a live call out to the sample data on SimpleDB.  Of course, the locally-saved XML doesn’t change in response to the parameters it is passed, but it’s still useful for a variety of different types of tests.

Code Shavings  Thanks to httpwebwitch for providing the breakthrough insight on the problem I was having with the XSL file.

*Although I did learn quite about manipulation of XML using PHP 5 as it turns out, which seems pretty slick.

Posted on 24th September 2008
Under: Developers' Journal | No Comments »

Using XSL Transformation to Format a SimpleDB Response

The previous post described a subset of _DocumentToken being successfully loaded into SimpleDB using SimpleDB Explorer (SDBX).   The next step was to build a simple web-based query form which enabled the user to enter a search term, have the search term passed to SimpleDB for processing, and then format the results into something presentable.  Collecting a parameter from a field on a form is relatively trivial, and passing it in a URL is not much more difficult.  Therefore, the real challenge of this work was the handling of the response, and formatting it for appearance.  This was accomplished using an XSL transformation (XSLT).

The initial question was whether the transformation should occur client-side, or server-side.  Testing some simple examples made it seem as though client-side transformations have been consistently implemented across the most popular browsers.  Therefore, client-side appeared to be the way to go — there’s just that much less for the server to do.  All it requires is pointing to the XSL from within the SimpleDB result XML, a line such as the following;

<?xml-stylesheet href="stylesheet.xsl" type="text/xsl" ?>

However, what wasn’t known until this point was client-side transformations result in the pre-transformation XML being displayed when viewing the source of the rendered page.  Does the viewing audience really need to see this?  Alternatively, if the transformation is occurring server-side, the only thing visible on the browser is the HTML code which results from the transformation.  Therefore,  it seemed server-side transformation was more robust, because it only demands the browser be able to faithfully render HTML code — something all of the major browsers have been doing for donkey’s years.  So, in a face-off between performance (render on the client) and robustness (render on the server), robustness wins, and hence the rendering will be done server-side.

To ensure the basic plumbing of server-side transformation was working, a simple half-dozen-or-so lines of code were used for a simple XML transformation using an XSL.  Turns out the XSL processor, while standard on PHP 5, was not configured by default.  So it was necessary to re-download php-5.2.6-win32-installer and then use Control Panel > Add/Remove Programs > Change to configure PHP to provide access the XSL logic, and in particular, the XSLTProcessor class.  A quick restart of Apache after the change and all was well.  The other minor detour was getting the XML and XSL to actually interact with each other as expected.  After an hour or two of struggle and dead-ends, the problem turned out to be the line;

xmlns="http://sdb.amazonaws.com/doc/2007-11-07/"

needs to be returned along with the rest of the SimpleDB response, and also needs to be included in the XSL, and replaces the line;

xmlns="http://www.w3.org/1999/xhtml"

At this time, it’s not entirely clear why this is the case, but suffice to say for  it can easily be added to each XSL for now.  Of course with all this working, attention turned to the XSL itself.  The article mentioned above was a good way to get re-acquainted with the basic subject.   Like most things XML-related, it’s not difficult, just tedious and syntactically finicky.  Working from samples found with a bit of googling, it was eventually possible to come up an XSL which transformed the SimpleDB response into a fairly nicely formatted set of URLs.  The only remaining issue is to send the SimpleDB response straight to the XSL transformation.

Code Shavings  The document root used by the local version of Apache was changed to point to local version of the Intellog folder hierarchy, so the application can be developed locally, and then moved over to the ‘parallel universe’ of the production server.  ♦   The SimpleDB library downloaded from Amazon had to be located in the …/PHP/includes directory.  A similar change will eventually have to be made to the production server.  ♦  But it turns out my good friends over at the ISP are not able to load the SimpleDB library onto a shared server, so it will be necessary to either local this library in the application directory, or move up to some sort of dedicated server.  ♦  Thanks to ModMySite  for the clue on the problem with the missing XSL transformation.

Posted on 23rd September 2008
Under: Developers' Journal | 1 Comment »

SimpleDB Explorer Uploads and Some SimpleDB Limitations

Turns out the problem with SimpleDB Explorer (SDBX), described previously, has been confirmed to exist in the product.  But Chambal is willing to make the changes recommended in a post to their forum.  That’s the good news.  The less good news is there is a limitation to the number of name-value pairs for a given SimpleDB item, which Chambal pointed out as 1024.  They were wrong — it’s actually 256.  Yikes!  This necessitated a workaround so the large number of key-value pairs required to store identifiers and tokens per document can be accommodated. 

Therefore, the original idea of calling the domain Document no longer makes sense.  There will be two domains; _DocumentToken and _DocumentIdentity.  There is a limitation of 250,000,000 name-value pairs for each domain.  That sounds like a lot, but in the context of the concordance application, with just a shade over 5000 relatively short documents parsed, there are already 560,000+ unique combinations of identity and documents, and a whopping 2.2 million unique combinations of tokens and documents.  Multiply each of these by three value pairs per item, and you’ve already chewed up 0.6% (1.5 / 250) and 2.6% (6.6 / 250) of the current name-value pair limit.  That’s uncomfortable, at best.  However, the plan is to press on, regardless, given the benefit of getting something scalable going quickly.  Also, this issue can further mitigated by one or more of the following;

  • The types of words deemed not useful as tokens can be increased, and eliminated from the concordance.  Single letters have already been eliminated, and there are likely lots of other character combinations which can suffer a similar fate.
  • Adopt a scheme for packing multiple tokens or identifiers into a single item, such as separating then with a ~ (tilde) or some such thing.  However, this may just move the limitation problem to the total size of the domain, or some other limit.
  • Compress the identifiers and tokens using some sort of binary representation.
  • Restrict the population of _DocumentIdentity to those identifiers associated with a geographic location associated with them.
  • The keyword search could eventually be implemented in MySQL, running on the EC2 platform.  This could be a less expensive way of doing it, as well.
  • This is to say nothing of the possibility of Amazon raising or eliminating the 250,000,000 limit, which does have an arbitrary ring to it.  Alternatively, Amazon might introduce keyword search to SimpleDB or S3, which would completely eliminate the need to implement this functionality as described above.

SDBX only supports uploads from MySQL, so the required table was implemented by creating the base.DocumentToken and base.DocumentIdentifier VIEWs in the SQL Server version of the Intellog database.  These views are then used to populate the similarly structured base._DocumentToken and base._DocumentIdentifier denormalized tables.  The logic to populate these tables was captured in the newly-created base.refresh_DocumentToken and base.refresh_DocumentIdentifier stored procedures, also in SQL Server.  This approach enables set math between the VIEWs and the tables, which will subsequently be used to generate lists of obsolete tokens which can be eliminated from the _DocumentToken and _DocumentIdentifier domains on SimpleDB.

To actually move the data from SQL Server to MySQL a query was prepared which generated executable SQL off of _DocumentToken.  Performance absolutely stunk — there’s no nicer way of putting it.  However, it was subsequently discovered that nobody in their right mind does it this way, but rather the LOAD DATA syntax is used for bulk loads from text.  Aided by the MySQL documentation in this regard, I eventually worked out  the syntax for the statement as follows;

LOAD DATA INFILE '...\\_DocumentToken.txt' INTO TABLE base._DocumentToken
	FIELDS TERMINATED BY ','
	STARTING BY '~' TERMINATED BY '\r\n';

But that’s only half the story — something needs to create the _DocumentToken.txt file the LOAD DATA statement ingests.  It was generated with the following SQL on the SQL Server side;

SELECT
	'~' + REPLACE(_DocumentToken.txt, ',', '') +
	',' + _DocumentToken.urlTxt +
	',' + CAST(uid_contains AS VARCHAR(36))
FROM
	base._DocumentToken

The net results was the 2.2 million rows loading in less than a minute. More than acceptable, of course. The SDBX upload functionality was then turned loose on the first 1000 rows of this table, and the upload worked perfectly, with the exception of the problem describe yesterday.  Performance of the SDBX upload was not all that bad, either.

Code Shavings  The LOAD DATA statement seemed more finicky than it should have been.  I never did get the ENCLOSED BY syntax working so delimiters on text could be used, and the tilde (~) character was kludged in to identify lines with real data on them (as opposed to blank lines, headers and the like).  LOAD DATA doesn’t seem to like Unicode text, either.

Posted on 19th September 2008
Under: Developers' Journal | 4 Comments »

Just a Bit More Site Taxonomy

Previously, the E.intellog.com/data branch of the hierarchy was described in quite a bit of detail.  A little needs to be added with respect to the app branch;

  • E.intellog.com/app/<ApplicationName>  Note in this case, the initial letter is capitalized, which is suppose to connote the idea the application is a proper noun — it’s the name of the product.  An example would be Roundabout.

This leaves open the question of what happens if a version of Roundabout is developed for industries other than the energy industry (which is identified by the E. at the top level of the hierarchy).  There are two possibilities; the application name is kept unique in the entire Intellog universe, or the prefix at the beginning of the hierarchy in combination with the application name uniquely identifies the application.  The Hobsons Choice, therefore, is whether to brand the Roundabout concept differently for different industries, or to live with the ‘confusion’ of having the unique identity of the application split, and appearing at both ends of the URL.

Posted on 18th September 2008
Under: Developers' Journal | 1 Comment »

More on SimpleDB Explorer

In an earlier post, SimpleDB Explorer (SDBX) was identified as a candidate product to facilitate the uploading of concordance data.  The specs looked good, but all the links to download the product were broken.  That particular problem has now been resolved, and the trial version of SimpleDB Explorer was downloaded and installed from the product website, with no issues.   However, the short version of the story is the product looks a little rough around the edges, and currently has at least one problem which is a potential showstopper.  More on that at the end of this post.

The first question answered was whether the access and secret keys used for S3 would also work with SimpleDB — they did.  These credentials were supplied to SDBX, and they worked just fine.  With that done, it was possible to see the SimpleDB domain create ‘manually’ with the C# sample code a couple of weeks back.  Queries of the domain contents were possible, using the Amazon-supplied Using Query documentation as a guide.

SDBX only supports upload from MySQL tables, so some time was spent getting familiar with the basics of MySQL, which ultimately resulted in the creation of a DATABASE* called base.  This is roughly analogous to the base schema in SQL Server version of the Intellog database.  Subsequently, the table base._DocumentToken was created in MySQL, which contains the uid, txt and urlTxt columns.  By convention, the leading underscore indicates denormalized data, as base._DocumentToken in MySQL is populated from the Document-contains-Token tables of SQL Server.  Nothing too sophisticated here; executable SQL statements are generated on the SQL Server side, and the resulting ’script’ is then run in MySQL.  The net result is a MySQL table which looks very similar to tables found in the SQL Server database, and has the added benefit of being uploadable by SDBX.

Using a brief video as a guide to the use of the upload logic provided in SDBX, a sample table with 1034 tokens contained in a single document was transferred over the MySQL as described above.  The upload logic was invoked, and the dialogue box configured with the source and destination of the data.  The upload seemed to run just fine with one substantial exception;  a single item was expected in the SimpleDB Document domain, with 1034 values related to the tokenTxt attribute.  Instead, there were 1034 items, each with a single value relate to the tokenTxt attribute.  Here’s the catch; regardless of the fact that the same uid was being supplied for all 1034 rows, it seems like SDBX still differentiates items in the domain by supplying its own unique identifier** for each item.

At first, it was thought it could the use (or lack thereof) of a standard name for the column containing the uid.  So the upload was re-run first with Item as the name of the column containing the uid, then using a variety of other permutations and combinations, none of which made any difference.  Short of some new information, this is beginning to look like either a bug, or worse, the behaviour intended by the developer of SDBX.

A simplified version of the problem described above was posted to their forum.  If and when received, links to the replies will be posted as a comment to this post.  In addition, there were a few other problems with SDBX which tend to indicate the product is pretty new;

  • There are a fair number of hard-coded spelling errors; for example, on the screen which introduces the upload video, the title reads "…How to upload MySQL data on Amaozn (sic) SimpleDB using SDB Explorer…"
  • There is a dearth of user documentation.  Virtually every help path leads back to the support forum, or to the videos described above.
  • The support forum has virtually no traffic to date.  The question described above was only the third topic.
  • Paste doesn’t work for the attribute grid on the dialogue box which allows the addition of a single item to the domain

And quite a few other little minor things.  However, none of the issues above is an obstacle, so long as the big issue described above can be resolved.

*Or SCHEMA which seems to be synonymous with DATABASE in the MySQL world.

**By inspection, it looks like a 64-bit unsigned integer as described for the UUID_SHORT() function described in the MySQL documentation — but that’s just a hunch on my part.

Posted on 16th September 2008
Under: Developers' Journal | 3 Comments »