Archive for August, 2008

Software Mirepoix: Apache - PHP - MySQL Installation Notes

With the index data load organized — if not actually done — it was onto the development of a PHP form to provide the user interface into the keyword search capability.  The form will initially be extremely simple — it will allow users to enter search terms which will be captured and passed to SimpleDB through the SimpleDB PHP library.  To get this done, it really was time to get that jaunty software mirepoix of Apache, PHP and MySQL installed, up and running in the development environment.  (See Code Shavings, below for specifics of the packages installed.)  The installation was more-or-less trouble free, with the possible exception of integrating the SimpleDB PHP library with the development environment.  To help others avoid the same mistakes, here are a couple of tips;

  • First things first; ensure PHP error messages are turned on!  Specifically, check the line which looks like display_errors = Off in the php.ini file, and change it to display_errors = On.  Eventually it will be turned off again if and when the server goes production, but for development, it’s essential to be able to see all the error messages being thrown off the code. 
  • Remember to restart Apache after making the change above, or it may not have the immediate effect you expect*.  The version of Apache installed provides an icon in the Windows Task Bar which enables a two-mouse-click Apache restart.
  • Do yourself a favour and write up a little PHP code to display the phpinfo(), such as shown below.  It will provide a nicely formatted summary of the current configuration of PHP.  It was essential for figuring out the way include_path was being set, for example.
	<HTML>
	<?php
	print phpinfo();
	?>
	</HTML>

With respect to the SimpleDB PHP Library itself, the experience has to be filed under ‘be careful what you wish for’.  One of the pitfalls of being in the SimpleDB beta group is it becomes clear the documentation and other support tools are only just coming together.  There is a lot of trial and error, and Googling error messages provides few hints which are useful.  These are early days for SimpleDB and it’s associated PHP Library.  Some hints;

  • The examples provided require a lot of editing before they actually work.  As someone else noted in the forum, there is room for a Hello World application to touch on all the aspects necessary to get an example working.  The code immediately below is what the sample file contains, followed by what it needs to look like in order to actually work;
	// @TODO: set request. Action can be passed as Amazon_SimpleDB_Model_GetAttributes
	// object or array of parameters
	// invokeGetAttributes($service, $request); 

	$request = new Amazon_SimpleDB_Model_GetAttributes();
	$request->setDomainName('Document');
	$request->setItemName('874C863D-FA06-4D2E-8C25-104B317623BD');
	$request->setAttributeName(array('tokenTxt', 'urlTxt'));
	invokeGetAttributes($service, $request);
  • Until such time PHP errors were turned on and it was possible to see exactly what was going on, a ton of time was spent fiddling with the set_include_path in .config.inc.php in the Samples folder, and also with include_path in the php.ini files.  It was assumed lack of error messages meant the code library wasn’t being accessed for some reason.  Once I had errors turned on, it was quickly determined there were no changes required to either statement related to the PHP execution path.
  • The SimpleDB PHP library may be buggy — at least one problem was referred to the forum.  All of the calls to Client.php come back with warning; Undefined index: port in C:\...\Amazon\SimpleDB\Client.php on line 411.
  • Up until now, Notepad and Wordpad were the text editors of choice.  Once again, do yourself download a more fit-for-purpose, quality editor such as Notepad++.  The difference in productivity is like night and day.

It is tempting to say "Apart from that, Mrs. Lincoln, how did you enjoy the show?"  It always seems worse at the time, but truth is, getting the mirepoix set up will take a solid afternoon of futzing around, but then it’s away to the races.  And the price is certainly right.

Go Beavs!Code Shavings  Much thanks to Edward’s Web Developer Site for an overview on setting up the development environment.  ♦  The specific packages and versions were as follows; Apache (apache_2.2.9-win32-x86-no_ssl-r2), PHP (php-5.2.6-win32-installer), MySQL (mysql-5.0.67-win32), MySQL GUI Tools (mysql-gui-tools-5.0-r12-win32) and MySQL Workbench (mysql-workbench-oss-5.0.24-win32).  ♦  Interestingly enough, the MySQL GUI tools download from Canadian mirror sites would not install, but the one from the Open Source Lab at  OSU seemed to work just fine.  Go Beavers.

*If PHP is running as a module — it is, in this case — a restart of Apache is required.  If it’s running as a CGI then the restart is not technically required.  But then again, a restart can’t hurt, either.

Posted on 28th August 2008
Under: Developers' Journal | 2 Comments »

Whither SimpleDB Explorer?

OK, it’s now resolved to always, always, always look for a tool before trying to write one.  The plan was to write a bit of C# to enable the upload of 150 MB of index information into SimpleDB.  Turns out, the same folks who produced BucketExplorer, have a complementary product called — as you might expect — Simple Database Explorer.  Experience with BucketExplorer has been excellent to date, so SDBExplorer seems like a natural.  The only problem seems to be the actual download of said product.  As of this moment, all of the download links are ‘expired’, and there is no way to get at it.  It’s assumed this particular problem will get solved, and therefore, the upload of the index data can be assumed away as well.  The only potential pitfall with the approach may be source database for the upload — the website only talks about being able to load from MySQL, as opposed to raw text files and the like.  Comments will be posted below providing updates on the download problem mentioned above.

Posted on 28th August 2008
Under: Developers' Journal | 2 Comments »

University of Calgary CPSC 594 Project Proposal (S73)


Intellog wrote a draft proposal (S70) in response to the call for projects for Dr. Guenther Ruhe’s Software Engineering  (CPSC 594).  In response to Dr. Ruhe’s comments to S70, a second draft (S73) has been prepared.  The main change is the addition of a new section, on page two, entitled Educational Objectives.  It describes four major areas where students learning will be enhanced through participation in the Metazette project.  As usual, any and all are encouraged to read the linked material, and provide comments, below.

Posted on 27th August 2008
Under: Business Development | 3 Comments »

More on Amazon’s SimpleDB

Well sometimes fortune shines, and the good folks at Amazon saw their way clear to accept Intellog into the beta for SimpleDB.  So despite recent thoughts to the contrary, work continues on the use of SimpleDB to implement keyword search over the ST1/ST49 documents.  The first, preliminary step was to estimate the cost of storing the necessary index information, which was done with the spreadsheet ST1andST49SimpleDBEstimate.xls.  This contains queries on Document-includes-Identity and Document-contains-Token and added the SimpleDB overheads.  It resulted in an estimated size of around 150 MB, which makes SimpleDB very cost effective.  The formula by which usage charges are calculated is a little hard to follow, but explained in sufficient detail in an example provided.

The top level container in the SimpleDB world is known as a ‘domain’.  Conceptually, this lies somewhere between the S3 ‘bucket’, and a traditional database table.   A domain contains items, which are collections of attributes.  Each item is uniquely identified with an item name‘.  The item name, in traditional database parlance, can be thought of as the primary key.  Each item is populated with zero or more ‘attributes’, each of which consists of an ‘attribute name’ and associated value.  There can be as many instances of a given attribute as desired, and they’re tied together using the attribute name.  No structure needs to be declare in advance — everything is created ad hoc, and on the fly.  Having spent a professional lifetime dealing with more structured data, this lack of structure is a little jarring. 

The Intellog objective for naming of SimpleDB objects was to have them tie as closely as possible to the established naming pattern used with the internal database.  The initial inclination was to call the domain used to index the ST1/ST49s simply DocumentIndex, because virtually everything in this domain will be related to indexing.  However, it will likely be expanded in the future to include other attributes about the document.  Therefore, a more appropriate name was simply Document.   The item name (ie. the primary key) will be taken directly from base.Document.uid, in order to tie a document in SimpleDB back to the internal Intellog database.  The tokens and identities associated with the document will be identified with the tokenTxt and identityTxt attribute names.  The URL for the document itself will be stored in urlTxt, providing a cross-reference to the original document stored in S3.

Having worked the the Getting Started Guide and worked out the interim SimpleDB naming standards, getting the basic SimpleDB functionality working was just a matter of working through the C# samples provided.  Within a relatively short period of time, the first ‘web concordance’ was on SimpleDB.  In other words, a SimpleDB item, uniquely identified with an item name, containing one attribute for each associated token or identity, and a cross reference to S3 with one final attribute containing the URL.

Unfortunately, after this initial, quick success, I went down a blind alley and tried to create yet another CLR-based stored procedure — called putSimpleDBConcordance — but it turned out to be more trouble than it’s worth.  It will make more sense to put together a short command-line, or even Windows Form application which will work through each of the documents to be indexed, and execute the logic described above.

Posted on 26th August 2008
Under: Developers' Journal | No Comments »

A First Look at Amazon Web Services’ SimpleDB

The ultimate objective of all the Concordance-related activity is to make an application available which enables keyword searches of ST1s and ST49s — and other documents in the future, of course.  While all of the processing activity is offline (on a laptop, as it turns out), the data eventually needs to wind up on a web-accessible platform.  This approach has already been applied to the storage of the raw ST1 and ST49 documents using Amazon Web Services’ (AWS) Simple Storage Service (S3).  Now it’s going to be the approach taken with the index as well, using AWS’ SimpleDB.

Well, at least that was the plan — it would have been smart to sign up for the latter prior to doing all the other work, because SimpleDB is still on a closed beta!  I’m on the waiting list, but barring some sort of miracle, some other approach to storing and serving the index information will have to implemented, at least for now — maybe for good.  I toyed with the idea of a jump over to Google’s AppEngine, until it appears as though Python is a prerequisite.  Nothing against Python, but it would be starting from scratch on the learning curve.

Although it was hoped it could be avoided for the time being, turns out now is the time to set up an Apache / PHP / MySQL development environment on the local workstation.  There are going to be Intellog applications developed and delivered on this platform in the future, so the setup is time well invested.  It will slow down the delivery of the keyword search of ST1/ST49s, but that’s life in the startup fast lane.

Posted on 25th August 2008
Under: Developers' Journal | 3 Comments »

base.putConcordance and Indexing All ST1 and ST49 Documents

The stored procedure described at the end of the previous post turned out to be the base.putConcordance*, and it’s job is to take an input parameter of @uid_Document, and based on the base.Document.urlTxt identified by the parameter, populate the base.contains, base.Token and base.includes tables.  It does so based on the tokens parsed out of the documents with the new base.Concordance(@urlTxt) user-defined, table-valued function. 

Beyond the basic functionality, putConcordance was created so it’s not sensitive to the order in which the documents are processed, or whether or not the document has been processed previously.  It’s assumed the Concordance.cs class will be tweaked over time, and all that should be necessary to do is to run putConcordance over the document again, and it should be smart enough to know when to add & remove tokens, and do all the other necessary housekeeping.

With apologies for the crappy picture quality.  Click for larger, but still crappy image.A couple of experiments with a single ST1, then a single month’s worth of ST1s, then a single year’s worth all indicated performance of the finished procedure was pretty good, so a script to process all the ST1s was generated.  Nearly eight years worth were processed in one hour, 36 minutes and 39 seconds.  Turns out it was much less time than it took to have my car flatbedded home when the clutch failed on the way to work.

Anywho, after indexing all of the ST1s, there were something around 500,000+ unique tokens, and 1.3 million instances of base.contains.  In other words, 1.3 million unique combinations of base.Document and base.Token.  Yes, the performance of the finished index is quite spectacular.  For example;

SELECT DISTINCT
	Document.localPathTxt
FROM
	base.Document AS Document INNER JOIN
	base.[contains] AS [contains] ON [contains].uid_Document = Document.uid INNER JOIN
	base.Token AS [Token] ON [contains].uid_Token = Token.uid
WHERE
	Token.txt = 'vero'

Produced complete results in well under a second, and this is on a fairly clapped-out Centrino Duo laptop.  Subsequently, an instance of base.Document for each ST49 –  every one from 2001-04-17 though to yesterday — was created.  This was quite easy to do with help from the good ol’ command prompt, where dir /s /a-d /b >temp.txt recursively lists all the files in the directory hierarchy and stores the results in the file temp.txt.  A little more search-n-replace, and there was a script ready to go.  These newly-created instances of base.Document were then used to script calls to base.putConcordance so all of the ST49 documents will be indexed as well.  Those are being processed as of this date and time, and time it takes to process all of the ST49 will be posted as a comment, when results are available.

Next up, some logic to pump the results up to Amazon’s SimpleDB, and an application which will enable keyword searches of the ST1 and ST49 documents stored in Amazon’s S3.

If you have any comments, questions, or suggestions — or if you know a Porsche mechanic who works for food — by all means, please leave a comment below.

Code Shavings  A unique index was put on the txt column of base.Token, but only after shortening the column  NVARCHAR(100) to be consistent with base.Identity.   This of course limits the tokens to 100 characters, but this shouldn’t be a significant encumbrance.  ♦  A unique index on uid_Document and uid_Token was added to base.contains, as was one on uid_Document and uid_Identity to base.includes.

*I have mixed feelings about this name, because it really implies there is a Concordance table, and that instances are being created in it.  But the flip side of the argument is it fits so well with the other objects created in this exercise.

Posted on 22nd August 2008
Under: Developers' Journal | No Comments »

The base.getConcordance() User-Defined Function

The next step in the Concordance implementation was to write the base.getConcordance stored procedure, which takes a URL of a text document as input, and returns an iSentence-compliant XML.  It’s implemented in C# and it doesn’t do much other than to retrieve the text from the URL provided, and sends it to Concordance.cs where it’s parsed and turned into XML.  Once complete, getConcordance simply sends the XML back to the stored procedure.   The active part of the C#  is only seven lines long, as follows;

// create reader & open file
TextReader tr = new StreamReader(urlTxt.ToString());
Concordance cncrd = new Concordance(tr.ReadToEnd());
tr.Close();

// prepare XML and return to calling stored procedure
ASCIIEncoding enc = new ASCIIEncoding();
MemoryStream ms = new MemoryStream(enc.GetBytes(cncrd.xml));
XmlReader xr = XmlReader.Create(ms);
return new SqlXml(xr);

There is probably even more efficiency to be gained, but this code achieved the objective, so further refinement seemed unnecessary at this time.  There were a couple of small gotchas which bogged down the effort, and some known limitations;

  • When deploying the package, it was necessary to select a Permission Level of ‘External’, on the Database tab of the Project Properties.  This indicates the managed code which implements getConcordance is going to access external resources — the file identified with the URL, in this case.
  • Functions written in managed code can only be deployed to the dbo schema.  For a variety of reasons, the base schema was deemed to be a better choice for this application, but there is no way of deploying directly to this schema from the Visual Studio IDE.  The problem was solved, for the time being by simply auto-generating the script which creates the function, changing the schema using search-n-replace, and then running the script again.  It’s even possible to delete the dbo-version of the function, but the IDE complains when the deployment is run again.
  • For some reason not fully understood right now, it’s not possible to have a SQL Server Visual Studio project reference a non-SQL Server project.  An error shows up when you attempt to add the project in.  This is a shame, because Concordance.cs was implemented as part of the Intellog class library.  To get around this for the time being, the code was simply cut-n-pasted into a like-named class within the Dbx project, which contains all the managed-code SQL Server objects created to date.
  • A known limitation is the URL must point to a local path, as opposed to any valid URL, but that will be resolved at some point in the near future.  For the time being, all the files to be parsed are local files.

With all of this complete, all that was left to do was make a slight modification to the base.Concordance table-valued, user-defined function so that it takes the URL as a parameter, and then calls base.getConcordance.  With all of the above complete, the following syntax is permitted;

SELECT
	amt
	,txt
FROM
	base.Concordance('C:\temp.txt') AS Concordance
ORDER BY
	amt DESC

where C:\temp.txt can be any valid path to a text file.  The syntax above returns a table of tokens, and the number of times the token appears in the document, sorted in order from most frequently occurring to least.  It’s been testing with documents of up to 250K, and returns the result table of 3300+ unique tokens in significantly less than a second.

Next up, some relatively simply stored procedure to populate contains-Token and includes-Identity.  By comparison, these will be very simply, and can be written entirely in T-SQL.

Posted on 21st August 2008
Under: Developers' Journal | 1 Comment »

Adding an .xml Property to Concordance.cs

Concordance.cs parses text documents into tokens, and there is now a database structure to store the tokens when they have been parsed.  Now the tokens have to be passed from the C# code to the database.  The method chosen to do this was to serialize the hashtable in Concordance.cs into XML, then deserialize it back into a table on the database side.  This permits the entire table of tokens to be passed as a single parameter containing the XML text.  And who said you couldn’t pass tables as parameters in SQL Server 2005?

The first step in the exercise was to implement an xml property for Concordance.cs.  This turned out to be trickier than first expected.  The first approach was to use simple string concatenation to create an iSentence-compliant XML, based on the somewhat mistaken assumption time could be saved by avoiding the unfamiliar System.XML namespace.  Initially, this tactic worked out, but died as parsing of a large document was tested.  Acceptable performance fell off a cliff — 30-45 seconds to parse a 250K document.  StringBuilder yielded virtually identical results.  The parsing was lightning-fast, and then 30+ seconds were chewed up with the seemingly simple task of assembling the XML text.

For this reason, and given there is no XML serializer built into the Hashtable class, it was decided to use System.XML to create the iSentence-compliant XML.  Besides, the iSentence concept is more-or-less implemented on the database side already.  The only new idea which needed to be added to the database was the ability to handle arrays of values, which is (now) handled by the getEmbeddedIdx(@txt) user-defined database function.  It simply looks for an index between two square brackets (eg. ‘[27]’) tacked onto the name, and extracts an integer value from it.  Everything else is built on top of this basic idea.  So, instead of the iSentence XML containing something like;

<?xml version="1.0"?>
<is:iSentence xmlns:is="http://www.intellog.com/xml/NameSpace/iSentence.xsd">
	<is:amt>
		<is:nm>amt</is:nm>
		<is:var>2</is:var>
	</is:amt>
	<is:txt>
		<is:nm>txt</is:nm>
		<is:var>1aa/05-24-084-07w4/00</is:var>
	</is:txt>
</is:iSentence>

it can now contain a series of values like;

<?xml version="1.0"?>
<IS:iSentence xmlns:IS="http://www.intellog.com/xml/NameSpace/iSentence.xsd">
	<IS:amt>
		<IS:nm>amt[0]</IS:nm>
		<IS:var>2</IS:var>
	</IS:amt>
	<IS:txt>
		<IS:nm>txt[0]</IS:nm>
		<IS:var>1aa/05-24-084-07w4/00</IS:var>
	</IS:txt>
	<IS:amt>
		<IS:nm>amt[1]</IS:nm>
		<IS:var>1</IS:var></IS:amt>
	<IS:txt>
		<IS:nm>txt[1]</IS:nm>
		<IS:var>6-25-91-9</IS:var>
	</IS:txt>
</IS:iSentence>

The code based on System.XML is very straightforward.  It consists of simply using CreateElement to create the nodes, and then adding them to the XML document tree using AppendChild.  The only SNAFU was the requirement to provide the iSentence URI each time CreateElement was used, which just seems awkward to me, or perhaps I’m just missing some as to how to use it properly.  But in any event, at the end of it all, the Concondance.xml property now returns an iSentence consisting of the tokens parsed out of the text passed to it in the Concordance constructor.  Job done.  Turns out the time invested in System.XML was well worth it;  the creation of the XML for the benchmark 250K document was lightning fast — literally, the blink of an eye. 

While working out the creation of the iSentence described above, some attention was paid to how it would be turned back into a table over on the database side.  To that end, the table-valued, user-defined function (TVUDF) base.Concordance(@xml) was created.  It uses the existing base.amt(@xml) and base.txt(@xml) TVUDFs to create two temporary tables representing the two arrays in the XML.  It then JOINs the two tables, and returns the results in one table, and two columns; amt and txt which contain the number of times a token occurs, and the token itself, respectively.

Next is a stored procedure, likely written in managed code, which will take a URL of some sort, pass it to C#, where the text from the URL will be retrieved.  This text, in turn, will be passed to Concordance.cs, which will parse and XML-serialize it.  The managed code will then pass the XML text back to the stored procedure, which will deserialize it with base.Concordance(@xml), and return a table, which can then be used to populate contains-Token and includes-Identity.

Simple.

Code Shavings  Assistance with the creation of the XML text with the System.XML namespace provided by The Bonobo Journal, with thanks.

 

 

 

Posted on 21st August 2008
Under: Developers' Journal | No Comments »

Concordance Database Implementation

The objective of the Concordance class described previously was to parse a text document into a series of tokens — short strings of text.  Truth is, though, these tokens actually fall into two broad categories, and it’s necessary to represent them in two different ways in the database design.  The first category is a plain token, which is truly nothing more than a string of text known to be in the source document.  An instance of the Token table is related to the Document through the contains table, hence Document-contains-Token.  The many-to-many relationship the contains table resolves can be used to link documents that contain the same token.  Pretty useful, but not earth shattering.

The second, and potentially much more interesting category of token is related to the base.Identity table though the includes table.  Some will remember previous discussion of the Identity table, but to summarize, it contains the identity of something through the base.identifies and E.identifies tables.  Of specific interest, for example, will be instances of Identity which happen to be well identifiers of one sort or another.  In other words, in the Document-includes-Identity relationship, you know which documents relate to which wells.  If the reader fully considers and understands the implications of this fact, your head is very likely to explode.  I know mine feels like it’s about to.

The two categories above differ in a couple of other ways.  An instance of Token class will be the result of document indexing logic, whereas an instance of Identity must be the result of some other process or procedure, such as SER.putWell, for example.  Also, deleting instances of Document should cascade through contains-Token.  In fact, if it were possible to code contains-Token as a VIEW, it would better reflect the nature of the beast than thinking of it as a table.  In the case of both includes and contains table, an amt column has been added which will keep track of multiple occurrences of the token in a given document.

Posted on 21st August 2008
Under: Developers' Journal | No Comments »

Concordance.cs

It almost brings a tear to my prairie-dust-filled, bloodshot eye, but I felt the necessity to write a little C# code to create something called the Concordance class.  It’s primary role will be to facilitate the indexing of the ST1s and ST49s discussed in recent posts.  But it’s also a class which could potentially be extended in the future to provide additional insight about what a given document contains.

Initially, it’s only role will be to parse out strings of text found in the document, using a handful of different types of delimiter characters (primarily, the space character, ASCII 32).  It will also eliminate strings which won’t be useful in any sort of index — a single letter, for example.  Finally (for now, at least), it will also count the number of instances of a given string of characters, although I’m not sure it will be used in the keyword search algorithm immediately.   But you can theorize a string of text occurs a lots of times tells you something about the nature of the document, so in it went.

After a couple of different whacks at it, a Hashtable was determined to be the correct construct to deliver the results, given it’s job is typically to manage lists of key-value pairs.  The key, in the case of Concordance, is the string of text parsed out of the document, and the value, will be the number of times the strings occurs.  By definition, this means the value will be less than 1, of course.

Code Shavings  What’s in a name?  According to dictionary.com, one of the definitions of a concordance is "[a]n alphabetical index of all the words in a text or corpus of texts, showing every contextual occurrence of a word"  It’s not exactly right, but pretty close, and certainly a little less generic than ‘index’, say.  ♦  Help with iterating over the hashtable was provided by House of Fusion.

Posted on 20th August 2008
Under: Developers' Journal | 2 Comments »