11 million installations can't be wrong

MySQL Journal

Subscribe to MySQL Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get MySQL Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

MySQL Journal Authors: Greg Schulz, Cloud Best Practices Network, Jayaram Krishnaswamy, Elizabeth White, Jnan Dash

Related Topics: XML Magazine

XML: Article

Practical XML with Linux, Part 3: XML database tools for Linux

Hierarchical, relational, and object databases

In previous articles (see Resources for links), I have discussed XML's value as a data format for Linux users. I have also discussed a wide variety of Linux tools for processing XML. But the data that XML usually represents often comes in forms that need to be managed on hard storage. Luckily, XML storage techniques and products have developed as much as other aspects of XML. We lucky Linux users have a wealth of choices available to us as we prepare to manage our XML data persistently.

Just a little theory

Although I used the word "practical" in the title of this article, a little background might be useful in choosing the best tools and techniques for managing your XML databases in Linux.

There are almost as many uses of XML as there are XML users, but there are only two ways of looking at how XML documents are organized. XML's roots lie in SGML, which was originally conceived as a way of structuring documents for machine preparation and human consumption. XML has inherited much of that bias toward documents, and is often used for presentation-oriented publishing (POP). Examples include books, slide presentations, and company Websites. POP formats tend to have elements and text that flow in a flexible and free-form manner.

XML has also gained popularity as the basis for data formats suitable for exchange between computer programs: consumed by machines but able to be inspected by humans. This is known as messaging-oriented middleware (MOM) because of its role in the infrastructure of applications. Examples include serialized objects, automated purchase orders, and Mozilla bookmark files. MOM formats tend to be highly regular, with elements making up well-defined fields with content according to strict data typing.

MOM and POP formats often impose different needs on XML databases, based on the differences in usage patterns and format. We will decide whether certain Linux database technologies are more appropriate for MOM or POP documents.

There are many ways of structuring databases. The relational model, used by well-known DBMSs like PostgreSQL and Oracle, is probably the most popular for new systems, but there are many other approaches. Databases can be:

  • Hash-based systems that store data as key/value pairs. That includes popular Unix database tools like Berkeley DB, DBM, and GDBM.
  • Hierarchical databases, once an especially popular format on mainframe and minicomputer systems, are represented in Linux by Adabas, which uses a very modern form of a hierarchical database.
  • Relational and object/relational databases store data in related tables that represent classes of entities. There are many examples for Linux, such as PostgreSQL, MySQL, Oracle, DB2, Informix, and Sybase.
  • Object databases store data as collections of object instances, using relationships based on object-oriented theory. Linux is poorly represented in this area.
  • Multi-dimensional databases generally model data relationships more richly than relational databases do. M3's Pick database for Linux is one example.
  • Semistructured databases are not as well-known as the above paradigms, and existed mostly in academia before XML's emergence. Semistructured databases support the notion that it is impractical to have a rigid schema for data that models the real world, given the fluidity of the real world. Many of its concepts are a natural fit for XML and related technologies like the Resource Description Framework (RDF).

There is a growing body of work on how to effectively manage XML data in hierarchical, relational, and object databases.

Hierarchical databases are often considered the most natural representation because XML itself is a hierarchy of elements and character data. Tamino, a hierarchical XML data store based on Adabas, offers some of the best performance for XML stored in native database format.

Developers sometimes store XML in relational DBMS by defining a column for each element type. There are other design patterns for that purpose as well. (See Resources for links.) The main problem with most RDBMS approaches is that they are so difficult to extend that they hamper flexibility. One can also just store the XML untouched as character large objects (CLOBs), but in this case, the RDBMS "degenerates" to a simple hashed database.

Object databases tend to store XML in a manner that approximates the document object model (DOM). That creates a persistent implementation of the various nodes that make up an XML document. This approach, as well as the DOM itself, is often considered to have excessive overhead. The eXcelon database is an example of an ODBMS XML store.

For more approaches and examples of XML databases, consult Ronald Bourret's excellent compilation on the subject. (See Resources.)

Toward an XML database standard

Databases must often span time, platforms, and application languages. Because data frequently must outlive code, standards are especially important in this area. As I have previously pointed out in this series, XML is built on standards; recently, there has been an effort to standardize XML databases as well. The XML:DB group is an initiative formed by organizations and individuals interested in XML database standardization.

XML:DB is working on a database connectivity API, a protocol for access control of XML document content, and XUpdate, a language for expressing updates to XML documents. The group has open discussion lists and the specification drafts are freely available. If you are interested in standardized XML database management, take a look at the group's work and encourage your favorite XML DBMS project to consider the specifications as they emerge.

Hacking from scratch

So, how do you begin storing and managing your XML documents persistently?

You can roll your own XML data store from many of the existing Linux tools. You can use DB, DBM, or GDBM, one or more of which should come with your Linux distribution. You could use the file's CRC or some other hash value as the key, or you could extract an element field from the XML document using a quick Perl or Python script.

PostgreSQL and MySQL can be turned into XML data stores using the techniques in the articles in the Resources section, or as simple large-object types.

Or you can just dump the XML files to disk, naming them with a hash or a given field you know will be unique. However, most people won't want to roll their own XML data store when so many specialized tools are available. In the following sections, we will look at some of the available tools for Linux.

Specialized XML data stores: Open source

First, the open source offerings -- you can find links to each product in the Resources section.

dbXML is an XML DBMS for Java that incorporates technologies like Simple Object Access Protocol (SOAP) for messaging middleware, XPath querying, and a rich Java API. dbXML's developers are founders of the XML:DB initiative.

hOpla is a rough-and-ready tool for defining mappings between XML data and PostgreSQL DBMS. An official GNU project, hOpla can set up table structures according to an XML file, and import and export data accordingly.

XML-DBMS was developed by Ronald Bourret, who also maintains the comprehensive list of XML database products that I mentioned earlier. XML-DBMS provides an object-relational mapping between XML and a relational database. The mapping is highly simplified from the DOM -- element types are classes (represented by relational tables), and attributes and character data are object attributes (columns). XML-DBMS provides modules for Java and Perl, but is not designed for standalone use. The main mapping object can be generated from a DTD or XSchema (the new XML schema format proposed by the W3C).

DB2XML maps JDBC ResultSets to DOM objects, and can trigger an XSLT transform for final presentation. It can be used standalone (including a GUI interface) or from other code using a simple API.

DBIx::XML_RDB, developed by Matt Sergeant, who develops many XML tools for Perl, is a Perl module that extracts XML from the results of database interface (DBI) queries. The output can be piped straight to XML::Parser, then to the many other XML modules that feed from XML::Parser.

My company, Fourthought, develops 4Suite, which contains DbDOM, a persistent DOM implementation for Python. DbDOM documents can be stored natively to the filesystem, or to a variety of backends, through the 4ODS object database, including PostgreSQL, Oracle, or even DB and DBM. DbDOM is integrated with other 4Suite technologies, including XPath and XSLT, XPointer, XLink, and RDF.

We also develop 4Suite Server, an XML dataserver that provides integrated services on the core 4Suite tools. That includes document management and XML-based Web services based on HTTP, SOAP, or CORBA, distributed transactions and concurrency, users, and access control. It supports experimental WebDAV support and the draft of the XML:DB standard. The 4Suite integration allows -- among other things -- metadata using RDF, content integration using XLink, and transforms and presentation using XSLT, indexing, and rules.

Specialized XML data stores: Closed source

Rogue Wave Software's XML-DB Link is quite an interesting product. It is basically a network API to databases using SOAP. You design a set of SQL queries and register them with the server, then invoke those queries remotely through SOAP messages and get the result tuples in a simple XML format. XML-DB Link is currently in beta; you can download a 30-day free evaluation.

Since SOAP is considered a leading communications protocol in the emerging field of Web services, XML-DB Link is a good early platform for experimenting with a possible major growth area. Its practicality is questionable, though, given SOAP's recent emergence.

Birdstep is a DBMS with some unique features. First, it is designed to have a very small footprint. Also, Birdstep data can be viewed in relational, object-oriented, and hierarchical modes. It also has a built-in hashing system for stored text, which creates interesting possibilities for indexing and efficiency. Birdstep is very suitable for XML storage; indeed, it comes with tools that allow stored XML to be accessed using DOM or SAX APIs. Birdstep is available for free download.

Lore, a Stanford University project, originated as a semistructured DBMS, but has since been tailored into an XML data store. It has proprietary XML query, indexing, and application-support features. It is geared toward deployments of document-management systems that support a wide variety of interactive searches on evolving data.

GMD-IPSI XQL Engine, formerly PDOM, is a persistent implementation of Java DOM for storing XML documents of arbitrary size in ready form for development, query, and extraction. It supports the draft of the XML Query Language (XQL). Of course, XQL is rather shaky -- the working group has had difficulty making progress toward their deliverable specification -- but it's good to see some experimental adoption. The package is free for noncommercial use.

Software AG's Adabas broke new ground in Linux DBMS. The company has also proven to be a pioneer in XML DBMS with Tamino, an XML data store that manages XML in a hierarchical DBMS for improved efficiency over the relational and object-oriented models. The Linux version is supported under SuSE 7.0 for IBM 390 mainframes. An Intel/Linux version is expected soon. This is a first for me: seeing Linux software come out on the mainframe port before Intel.

OpenLink Software's Virtuoso is an impressive array of XML middleware technologies. It includes an XML data store, either native or through SQL mappings, XPath query, XSLT transforms, and text indexing. It can be accessed using SOAP or WebDAV, and supports Internet protocols like HTTP, SMTP, and NNTP. Virtuoso is a well-integrated device for deploying Web services. A free evaluation download for Linux and other platforms is available.

IBM offerings

If you use IBM DB2 Universal Database Version 7 on Linux (or other platforms), you have the XML Extender, which allows you to generate specially formatted XML documents from the results of SQL queries. It also supports general storage of XML documents and DTDs of any form. XML Extender uses DB2's object-relational features to manage XML in database columns. IBM provides a proprietary mechanism to index the documents based on the values of certain elements. Text-based indexing is available using the DB2 text extender, a related tool.

One neat feature of the XML Extender is that it can extract XML from the query results of any Open Database Connectivity (ODBC)-compliant DBMS. This can be a heterogeneous database integration tool for XML middleware. It includes XSLT and XPath transformation and result query tools for postprocessing after the SQL query. This allows one to use XSLT as a stored procedure language, to some extent.

IBM's XML Extender comes with administration tools similar to those provided for DB2 itself. Of course, it's hard to imagine finding wizards that can truly help manage XML data formats efficiently.

IBM seems to have something for every aspect of XML in its alphaWorks project. DatabaseDom lets you use a sample XML file to define a relational database structure, generate similar XML format from the database, and update or insert a database from XML. The functionality is encapsulated in a JavaBean that processes DOM objects, making it easy to integrate into Java, XML, and database projects.

XML Lightweight Extractor (XLE) is an alphaWorks project that accepts DTDs with special annotations that indicate the mapping of XML content model to relational tables, generating a mapping that can be used to map relational databases to XML instances using JDBC.

Oracle utilities

Oracle's XML SQL Utility (XSU) is a great tool if you already have an Oracle database, and you want to extract MOM data from normalized relational tables and perform table updates in the other direction. XSU maps the result of regular SQL queries into XML documents and inserts data from documents into the appropriate relational columns. This could be used for anything from importing data into other applications and DBMS to creating transaction logs suitable for human inspection. XSU also has an XSLT engine that can automatically apply transforms to the results of XML document generation. That could be the basis for a very flexible database reporting mechanism.

XSU is written in Java and works with Oracle on Linux. It can be used from Java and PL/SQL stored procedures, or directly from Java applets or server-side code. The XML output and input can be in string form or DOM form.

Oracle's utility is a handy tool for those using Oracle and XML in their applications. It's easy to set up (much easier than Oracle itself, to be sure) and is built into Oracle 8.1.7 and the coming 9i. The only problem is that it is strongly tied to the relational model. The elements in the XML documents are generated from table or view column names; similarly, element-type names are used to match columns for insertion. Some hierarchy can be contrived, but overall, the setup is somewhat alien to XML -- it might as well be a simple hierarchy file. The XSLT integration helps, but sometimes requires extra work in XSLT on what should be set up in the source document.

Oracle XSU is a free download if you have registered for the Oracle Technology Network. As usual, you can get the book you need from O'Reilly & Associates. Users of Oracle's XML tools should certainly get Steve Muench's well-written book on the subject. (See Resources.)


As you can see, there is no lack of options. Whether you're a poor college-student hacker or a well-funded organization that has made a sensible decision to develop software on Linux, there are plenty of open source and commercial tools for your XML storage needs. In fact, there are very few XML tools that do not have some kind of Linux port.

You can store documents as native XML, build XML from other forms of data, or even develop hybrid systems with POP data stored as native XML and MOM data extracted from database structure.

But XML is all about standards; be sure to demand standards support from your vendors or to contribute to the standards support of your favorite open source tool. The only downside of having all those choices is the potential difficulty of changing your mind once you've chosen.

More Stories By Uche Ogbuji

Uche Ogbuji is cofounder and CEO of Fourthought, Inc.,
a software vendor and consultancy specializing in XML
solutions. A computer engineer, he has worked with XML for several
years, codeveloping 4Suite, an open-source platform for XML
processing. A frequent conference speaker, Uche has authored a number
of articles on the practical use of XML.

Comments (2)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.