Return to PHP and SimpleDB

It’s been a very long time — last August, to be precise — since Amazon’s SimpleDB was first mentioned.  At the time, it was being investigated as a potential method of implementing full text search.  A lot of time, effort and money has passed under the bridge since then — not to mention that search functionality was eventually implemented with Solr.  But it’s now time to return to SimpleDB, but for a completely different application; to store user profile and session information required to provide secured, session-based access to the Intellog website.  After re-reading all the SimpleDB-related blog posts, I now realize there wasn’t a lot of detail on server configuration for PHP/SimpleDB, so that will be fleshed out now.  The most relevant post was Software Mirepoix, and it’s worth taking another look to provide some context for the notes below.  The objective is to establish a more-or-less standard approach to deploying PHP/SimpleDB apps — at least an approach which can be migrated to the production environment when the time comes.

But first…there was the question of which specific PHP/SimpleDB library to use.  There were three options identified; the PHP Library for Amazon SimpleDB, the php-sdb library by David Meyers, and finally the Zend Framework (ZF).  I couldn’t find a lot of relevant information on the use of ZF with SimpleDB, despite a post to the SimpleDB Discussion Forum.  David Meyers’ library looks great, but it’s strength — masking the underlying complexity of SimpleDB interactions — was actually seen as a barrier to clear understanding of the SimpleDB interface, at least for now.  By default, therefore, the Amazon-supplied PHP library was the way to go.

There have been quite a few updates to the Amazon’s PHP Library since it was last employed, so the latest version (amazon-simpledb-2009-04-15-php5-library.zip) was downloaded, and unzipped into a folder of the same name (minus  the .zip, of course).   Within this folder, there was another called src, and the ReadMe.html.  Within src, in turn, there was a folder called Amazon.  The latter folder was the one copied from C:\Program Files\PHP\include, and the include_path line in php.ini was modified by appending C:\Program Files\PHP\include to its existing definition.

.config.inc.php was put in C:\Program Files\PHP\include\Amazon\SimpleDB.  This is the same place as the library class files Client.php, Model.php etc., rather than in the folder(s) where the Roundabout application is located.   Because .config.inc.php contains the Amazon Web Services (AWS) access key and the secret identifier, it was felt it was better to keep it out of the web-accessible hierarchy folders under DocumentRoot.  Incidentally, .config.in.php also contains the __autoload function which according to the inline documentation "is responsible for loading classes of the library on demand".   It’s not 100% clear what this means, but the net effect of the function is to make all the classes in the library available to the application code. 

To provide an initial test of the configuration, one of the samples provided as part of the library –  ListDomainsSample.php — was copied over to the app.E.intellog.com/var/www/html folder.  Just one line needed to be changed; include_once('.config.inc.php'), needed  to be modified to read include_once('Amazon/SimpleDB/.config.inc.php').  Keep in mind, the include_path in php.ini, modified above, tells PHP where to start looking for a class when it can’t be found in the local folder, so there is no need to be more explicit with the include_once statement.

Oh yes, and it’s also necessary to define the $request variable in ListDomainsSample.php, but that’s a one liner, as per $request = new Amazon_SimpleDB_Model_ListDomainsRequest(); But with that done, the code lit right up and was able to produce a listing of domains associated the SimpleDB account.  If you have any questions or comments, please do not hesitate to contribute them below, and thanks for reading!

Code Shavings  Executing the sample application above initially resulted in an error message which, amongst other things, said "[u]nable to find the socket transport ’ssl’ - did you forget to enable it when you configured PHP?"  Some Googling revealed this is due the lack of the OpenSSL extension in the PHP configuration.   This problem was addressed by upgrading PHP in the development environment from the original, installed using php-5.2.6-win32-installer.msi, to a slightly more up-to-date version, installed with php-5.2.9-2-win32-installer.msi available from php.net.  The real trick, though, was to make sure when it got to the step in the installation where it extensions are selected, that OpenSSL was one of them.  Somehow, I missed that the first time around.  The installation script even configures php.ini so it knows about the OpenSSL extension.   ♦  To this point, I wasn’t clear on precisely the way included files worked in PHP.   Turns out the file Amazon\SimpleDB\Client.php (for example) contains a single class called  Amazon_SimpleDB_Client.  Note the name of the class mirrors the directory structure, except the slashes have been replaced with underscore characters.  This pattern appears to be adopted for all files in the Amazon library.  ♦  As with OpenSSL, the installation script for PHP is smart enough to know to add extension=php_xsl.dll to the php.ini file.

Posted on 26th May 2009
Under: Developers' Journal | No Comments »

Collections of People in the Intellog Database

Click for larger image.Collections of people are represented in the Intellog database by resolving a many-to-many relationship between base.Person and the base.Team using the base.populates table.  This simple, three table structure is illustrated in the diagram to the immediate left.  It allows a given person to be a member of an unlimited number of teams, and for a given team to have an unlimited number of members.  All three of these tables already existed in the Intellog database, but a little cleanup was required.  In particular, the xml attribute in Person and Team had to be upgraded from the iSentence schema collection to iSentenceV1.  A migration of this type is still surprisingly awkward unless I’m fundamentally missing something about casting between XMLs belonging to different schema collections.  It involved creating a second XML attribute validated with the iSentenceV1 schema collection, migrating the data from the old attribute to the new one, dropping the old one and finally renaming the new attribute to the same name as the old attribute.  Whew.

In addition the cleanup exercise, it was considered necessary to create a base.putTeam stored procedure which is used to populate the base.Team table.  It implements a few assumptions;  in addition to a verbose name, each team is identified with a unique four character label*.  It just seemed to be a handy middle-ground between the full uid, and the verbose name of the team.  When base.putTeam creates a new instance of base.Team, it queries to see if the lbl exists, and if it does, skips the creation of the new instance, and displays the existing instance.  Surprisingly, neither the lbl or nm occupies its own attribute yet in base.Team — they are both found embedded in the xml attribute an iSentence.  This may seem a little quirky to some.  However, while the database is in an evolutionary phase, it doesn’t make sense to be adding and deleting attributes (ie. columns) all the time.  Also, base.putPerson has not been usable with base.Person for quite some time, so it was fixed up to make adding instances to base.Person a little easier.

With this done, Person and Team were cleaned up and their population completed to support the testing of the Roundabout application, the development of which is ramping up again.  To make it easy to verify the accuracy of the population of all tables, the base.TeamMember VIEW was created which INNER JOINs base.populates to base.Person and base.Team and displays the verbose names of both.

*This started life as the ERCB’s ST104A code, but was extended to cover off companies not in the ERCB database.

Posted on 15th May 2009
Under: Developers' Journal | No Comments »

The Basic Login Experience

The previous post described some of the objectives of Intellog customer* authentication, and this post provides some guidelines for the design and implementation of the login process.  This is a tad more than simply identifying a screen layout (see mock up, left), and describing functionality.  It also tries to capture some of the other aspects of the ‘experience’ which will have an impact on the customer’s interaction with Intellog.  Everything below assumes the customer profile has already been set up.  Also, the process for changing the customer profile is beyond the scope of this post.

The single most visible artifact of the login process is the string of characters (that is, the user ID) used to identify a given customer to the site.  Because it is top-of-mind, the customer’s email address is the user ID on Intellog.  Once the customer enters their email address, and clicks the Login button at the bottom right, the email address is used as a key to retrieve the customers profile information.  If no profile can be retrieved based on this key, then it’s assumed the email address was either typed incorrectly, or no profile exists for that customer.  The login screen is recycled with some error messages displayed.  No information from the profile is displayed to the user until the round trip to the identity provider** (IP) described immediately below.

Amongst a variety of other information, the customer’s profile contains the IP to which they normally authenticate.  Which specific IP they are using will determine what appears next.  If they are authenticating to Google, for example, the Google-specific login pages will appear.  If they are authenticating to an Intellog-provided OpenID, Intellog-specific login pages will appear.  Noticeable by its absence is a field in which to enter a password.  Supplying a password, or other authentication information is delegated to the login page(s) from the IP.  This also leaves the IP with sole responsibility for handling a ‘hot’ combination of user ID and password  and/or other authentication information.

The Remember Me checkbox really isn’t doing much on this screen — if it has been checked, it’s assumed the value from the Email Address field entered previously (and stored in a cookie, most likely) is automatically used to retrieve the customer profile.  This is followed by automatic navigation to the IP associated with the email address.  If it’s the same customer using the client machine as last time, they will be able to provide credentials to the IP and be authenticated.  If it’s a different customer using the machine, they obviously won’t have a clue what credentials are required, and they’re dead in the water. 

Unless…unless…the IP also has its own Remember Me or equivalent facility, which allows credentials to be cached on the local machine, and supplied automatically.  If the customer elects to use the IP’s Remember Me facility, they have tied the security of their account to the physical security of their client machine.  That’s their choice, of course, but one which would be hard to recommend to anyone in their right mind.

It’s assumed the customer authenticates to their IP, and they have elected to send some of their IP-stored profile information back to an Intellog page.  The profile information returned by the IP is compared to what had previously been stored in the Intellog customer profile.  If there are any differences, the information from the IP is considered authoritative and overwrites the Intellog customer profile information.  This is aligned with a philosophy that  in the future, in a galaxy far, far away, security credentials will be managed in exactly one place, and they will propagate out to whichever services a given person uses.  Change your security credentials in this fabled ‘one place’, and they will automagically appear everywhere.

With all that done, the customer will come back to the first application page available to logged in users.  The most telling evidence of this fact will be the appearance of the users first name, last name and company affiliation in the top right of the standard Intellog header bar, and a Logout button on the footer bar.

*User had a very pejorative ring to it, to it is henceforth expunged in favour of the much jauntier sounding customer.

**See previous post for full discussion of the concept of identity provider.

Posted on 12th May 2009
Under: Developers' Journal | No Comments »

Intellog User Authentication Manifesto

The Parapet After a round of marketing meetings related to the Onramp beta release, attention returns to development issues.   Next up, finalizing the details for the user authentication process previously discussed in the blog posts …Roundabout OpenID Login Workflow and Implementing OpenID with PHP.   Much of what is contained below reflects some rapidly evolving trends in identity management, coupled with a more developed understanding of how users will interact with Intellog.

As much as I would like to say it’s going to be an OpenID world — some day, I hope it is — there is clearly going to be a lot of competition for the user identity parapet.  The likes of Google, Microsoft, Yahoo, Twitter, AOL and others all see their native user identification as the centre of their customers’ universe.  They won’t encourage user migration to an open standard they do not control, because there is just too much potential value in knowing what other sites their customers are visiting and using.   In addition, the shear number* of users with identities on these major sites means OpenID will have to co-exist with these other de facto identity providers for quite some time to come.  Therefore, the Intellog authentication system needs to reflect not only OpenID, but the identities provided by the other major sites.

It’s logical to assume, therefore, users coming to Intellog will already have an online identity (of which OpenID is just one possibility) and may want to use it, instead of physically re-entering all their user information for an Intellog-issued identity and profile.  To the greatest possible degree, therefore, the details of an existing online identity should be transferable to the users’ Intellog identity and profile, subject to the user’s approval, of course.

At the same time, there will be users who have an online identity as noted above, but don’t care to use it anywhere other than the site which issue it to them.  It’s also possible a very small percentage of potential users will have no online identity at all, or at least one of which they are aware.  In either of these cases, it would be desirable for Intellog to issue a branded user identity of its own.  This is preferable to sending them off to a third party to get an identity and potentially lose them on the return trip to Intellog.  Identities issued by Intellog should be OpenIDs, to remain aligned with the trend to separate identity management concerns from content provision, which is likely to be of increasing importance in the future.

Subsequently, when an Intellog user logs in, they will identify their particular identity provider (IP), and will then work through the authentication process required by that particular IP.  Assuming they successfully authenticate themselves, they would then automatically come back to Intellog, with some sort of identifier which could be used to retrieve their Intellog profile.  Following this, an Intellog session will be initiated to maintain their ‘logged in’ status, and other session-specific information.  This session will exist until such time they completely close their browser.

In summary, and to use OpenID parlance, Intellog must have the ability to act as a relying party for OpenID and all the other de facto identity providers.  It would also be desirable for Intellog to be an OpenID identity provider, which would enable potential users to create an OpenID which would then be used to log into the Intellog site.  In this latter case, it will be important to have the creation of the OpenID and subsequent login as ‘branded’ an experience as possible.  This is because there may be some users unfamiliar with OpenID and shared identity concepts, who may be put off by seemingly having their personal information shared between multiple parties.

Next up, some specific technologies which would appear to address the requirements outlined above.

*A quote attributed to Scott McNealy (Sun) was "a standard is anything shipping in numbers".

Posted on 11th May 2009
Under: Developers' Journal | 1 Comment »

Indexing of Saskatchewan Well Bulletins

With the marketing effort well underway, it was time to pay some attention to building up the body of of documentation available for the Onramp search engine.  Next on the list were the Well Bulletins, Saskatchewan’s equivalent of the ST1.  The first job was to download a copy of each of the reports and upload it to Intellog’s S3 infrastructure  All reports were uploaded, right back to 2003-01-02, the first day they were made available.  (Note: Well Bulletins are not published on weekends or holidays.)

But of course, the real objective was to make the Well Bulletins searchable with Onramp.  To achieve this, the now more-or-less standard set of steps for the indexing of text files was followed;

  1. A copy of the files to be indexed was captured in a parallel directory hierarchy on a local workstation, and then the list of files was generated into a file called manifest.txt.  This was accomplished with the DOS (remember that?) command dir /s /b /on /a-d >manifest.txt.
  2. manifest.txt was imported into a temporary SQL Server table called dbo.manifest$, which was subsequently used to populate localPathTxt and urlTxt of base.Document.  Inserting instances into base.Document automatically assigns the globally unique identifiers, which makes the name of these documents unique in the known universe.
  3. The Excel spreadsheet fileSizeHack.xls was used to establish the size of each Well Bulletin file, and then this data was imported into the temporary file dbo.Filtered$.
  4. The data from dbo.manifest$ and dbo.Filtered$ was then combined, and the iSentence-compliant column in base.Document.xml was populated with sizeAmt, sourceLbl, typeLbl and udt.
  5. SQL was then used to generate wrapBatch.sh, which is a series of calls to a second shell script called wrap.sh.  The latter takes the parameters passed to it in each line of wrapBatch.sh, and generates a Solr-compliant XML file.  Each XML is stored in a separate file named using the globally unique identifier created when the reference to the document was inserted into base.Document, followed by the .xml extension.
  6. SQL was also used to generate index.sh, which issues a cURL statement to feed the wrap.sh-generated XML files to Solr, and then issuing a commit statement when all else is done.
  7. These two files were then uploaded to the server.  wrapBatch.sh was executed, and the XML files generated.  These XML files were then downloaded back to the local workstation.  (For those who think this sounds a little bass ackwards, it’s simply because wrap.sh contains command syntax native to Linux — wrap.sh doesn’t currently work on Windows).
  8. index.sh was executed on the local workstation (using the temporary name index.bat), and the local version of Solr was populated.  A few modifications to wrap.sh were required to remove some control characters which crop up in the Well Bulletin files.  These modifications necessitated a couple of iterations of the previous and this step.

Once the test indexing was complete, a few modifications to /Onramp/xml/ApplicationDefinition.xml/Onramp/xsl/outputSearchResult.xsl and /Onramp/php/Onramp.php were required to accommodate the new file type and its source.  Later in the evening, index.sh was executed on the production server, and the modified files were uploaded, which completed the process.  The only gotcha was Java running out of heap space, which was cured (?) by using java -Xmx512M -Xms512M -jar start.jar to increase the initial and maximum heap space available.  It may have to be increased still further in the future.

Code Shavings  Actually, both the raw text (TXT), and the nicely delimited equivalent (CSV) files were both downloaded from the SER website, and it was the original intention was to index both.  But the excerpt of the CSV files displayed on the results screen looked really rough.  So it was decided to drop the CSV from the index, and to eventually make the CSV file available as a separate link, immediately adjacent to the TXT, on the results page.  ♦  The SQL used to accomplish the steps above was captured in the sqlTtxt.txt files found in the working folders saskatchewan/2009/04/24 and 27.  ♦  Thanks to Thierry Collogne and Caucho for their assistance in resolving the Java heap space problem.

Posted on 28th April 2009
Under: Developers' Journal | No Comments »

Onramp Added to Amazon Web Services Solutions Catalogue

Applications built using Amazon Web Services (AWS), like Intellog’s Onramp, have the opportunity to be listed in the AWS Solutions Catalogue.  With the recent release of the beta, a description of Onramp was submitted to AWS, and was published in the catalogue shortly thereafter.  Click here to take a look, and please feel free to provide comments or feedback below.

Posted on 17th April 2009
Under: Business Development | No Comments »

242 Reasons Why Intellog’s Onramp is a More Efficient Search Engine

A user recently asked us to compare Intellog’s Onramp search engine with other well-known search methods.   For example; they wanted to find information on "multiwell proration", for which they would normally use a general-purpose search engine such as Google®.  They wanted to know how these search results would compare to results from the source site’s embedded Search box and in turn, how they would compare to results from Intellog’s Onramp search.  Here are the results of our investigation;

Click for larger image. Google® Search If you click the thumbnail to the left, you will see Google®’s results are surprisingly good, featuring content from Alberta’s Petroleum Registry and the ERCB.  But it also includes some information from the Texas Railroad Commission* website, which is not surprising, because there were no geographic criteria specified in the search.  In other words, Google® is providing a good, high-level overview of information available throughout the known universe.  But clearly, some refinement of the search criteria is going to be necessary to actually get to the page or two of information of specific interest.

ERCB Search Let’s assume  the knowledge gained from the Google® search above is used to determine that the ERCB website is the one which is of interest.  You go to that site, and input the "multiwell proration" search criteria into the Search box at the upper right.  If you click the thumbnail, you will see the results are much closer to the intended target.  In fact, the specific document containing the information — Directive 17 — is just the fourth hit down.  This is great, with just one, fairly significant drawback.  When you click on the link, it downloads a 243 page PDF.  It’s now up to you to go through that document to find your information.

Click for larger image. Onramp Search  Finally, let’s assume you use Intellog’s Onramp search engine, and you input the same criteria as you did with the previous two methods.   The first hit is page 141 of Directive 17.  Not only has Onramp found the document for which you are looking, but it has also used PageIndex to rank the pages within that document in order of their relevance.  With a little luck, the specific information for which you are looking will be contained in first page you read.  Which means, of course, there were 242 pages you didn’t have to read…242 reasons why Onramp is a more efficient search!

It’s important to note that this comparison is in no way intended to disparage the first two search methodologies, but rather to highlight Onramp’s specific strengths; content which is tailored to the energy industry it is intended to serve, and secondly, the use of PageIndex to provide indexing of documents right down to the individual page level. 

Please feel free to provide comments below, or if you have your own case studies, don’t hesitate to provide some details so we can publish them in the future.

*The TRRC has responsibility for oil & gas development in the state of Texas.

Posted on 31st March 2009
Under: Business Development | No Comments »

Onramp — Intellog’s Next Generation Search Engine

Onramp is Intellog’s next generation search engine — like Google® and Yahoo®, Onramp makes finding information easy; enter a few search terms to identify the target of your search and the information meeting your criteria is displayed. But Onramp improves on the state-of-the-art in two important ways; it tailors the tools and content to specific industries, and Onramp is capable of indexing information right down to the individual page.

Initially focused on the energy industry, and in particular, petroleum production in the Western Sedimentary Basin, Intellog has indexed the full text of every ERCB Directive so you can find the specific page of interest, without having to read unrelated material.   In addition, Intellog has indexed the full text of the ERCB’s Well Licenses Issued (ST1), Drilling Activity (ST49), and Pipeline Approval & Disposition Daily List (ST96) reports for the past eight years — nearly 9000 documents and growing daily. And new content is being added all the time.

Onramp can also be seamlessly integrated into your corporate website, enabling its powerful search engine to enhance your site’s brand and capabilities. Your own documents can be indexed with Onramp, and set up for secure inhouse or public access.  Learn more…

Ready to hit the Onramp? Click here, or if you need more information or have any questions, please contact us.

Posted on 26th March 2009
Under: Business Development | No Comments »

Oh What a Tangled (CSS) Web We Weave…

Sir Walter ScottWorking through  the final round of testing the new beta release of Onramp on no less than five browsers (Safari, Firefox, Internet Explorer, Chrome and Opera), I came up against what appears to be an intractable saw-off between IE and Opera.  It relates to the way the content panel* is being controlled with a related CSS file.  Without going into all the gory details (they’ve be more than enough time for that later), I’ve found the multiple CSS files I have been using to this point were persistently getting tangled up.  A change in one precipitates unexpected consequences elsewhere, and similar difficulties. 

The primary cause of this problem related to the original vision for a series of ’snippet’ CSS files containing just the CSS relevant to a particular object on the page.  It seemed logical, at the time — the directory listing could be used to localize on a particular block of CSS.  However, the perceived advantage of this approach was outweighed by the negatives.  As a result, these files have been steadily merged together back together, and it’s now time to merge the last two remaining CSS files, Intellog.css and layout.css.  Henceforth, the pattern of CSS usage will be as follows;

[siteRoot]/css/Intellog.css  Contains all of the CSS which is shared by the entire suite of Intellog applications, whereas…

[siteRoot]/[appRoot]/css/[appRoot].css  contains the CSS which is relevant to a particular application. 

Using the Onramp application, for example, the formatting for the result table produced by outputSearchResult.php can be found in /Onramp/css/Onramp.css.  Pretty much everything else is in /css/Intellog.css, because it is common to all applications.

It’s anticipated other, third-party CSS files will be added in the future, and these should be placed in whichever directory is most relevant.  CSS relating to  Onramp application would go into /Onramp/css, whereas CSS relating to Roundabout would go into /Roundabout/css.  CSS relating to both applications would go into /css.

*This is the area of the user interface lying between the header and footer bar.

Posted on 19th March 2009
Under: Developers' Journal | No Comments »

The Use of lnk Element in ApplicationConfiguration.xml

ApplicationConfiguration.xml has been described previously as the repository of all information related the configuration of the Intellog base application, as well as the configuration of Onramp and Roundabout.  Within those three files, lnk is the standardized element used to represent a logical link to other content.  It consists of one or more of the following elements (presented in alphabetical order).

  • dsc  A brief description of the lnk, for internal, technical purposes.  The content of this element should not normally be exposed to the end user, as it likely contains technical jargon.
  • helpXhtml  The help text which is associated with the lnk.  This element should contain valid XHTML as you would expect to find in between the <BODY> tags of a complete document.
  • imageUrlTxt The URL of the image to be displayed instead of the lbl or screenLbl.
  • javascriptTxt  The text of the JavaScript which will be executed when the lnk is clicked.  This element is mutually exclusive with txt, described below.
  • lbl  The unique identification of the lnk.  This is a mandatory field, and will also serve as the onscreen representation of the lnk, in the absence of screenLbl, described below.  Camel case is recommended, and spaces and other punctuation are to be avoided.
  • screenLbl  What is displayed on the screen to represent the lnk.  If this field is not populated, then lbl is used in its place.  Spaces and punctuation are acceptable.
  • seq  The sequence of the <lnk> element when it appears within a collection of similar lnk elements, such as would be found in the breadCrumbWkflw.
  • tipTxt  The text tip which is automatically displayed when the user hovers the mouse over the lnk.
  • txt  The text of the hyperlink — in other words, what ends up associated with the href attribute for the finished link.  This element is mutually exclusive with javascriptTxt, described above.

Actually, it is more accurate to say lnk will be the standardized element in ApplicationDefinition.xml used to represent a link to other content.  In other words, this post documents the desired future state of this element.  There are currently a variety of inconsistencies in its use, as well as considerable overlap with the btn element.  This post serves as a reference which will govern the revision of existing code so it complies with the standard, as well as a guide for new code development.

Posted on 12th March 2009
Under: Developers' Journal | No Comments »