Register for a course!Look at our course descriptions...Learn more about our company...Visit our news and articles...Have a question? Find an answer...Learn more about our affiliate program...
 
XML XPLaned:
The mystery of the missing 3rd file

'XML' has a reputation -- sometimes deserved -- of being arcane, obtuse and almost a programming language in its complexity. It can be, but that's a bit of bad press. I think it's easy enough to explain what it is as 'more than HTML' and how it differs by being more than a 'Word' or word-processing exercise and why that makes it different and very powerful for large organizations and the Web. As simply as possible, XML (and it's 'parent', SGML before it) uses a third reference file when any given document is presented for display in a browser or in print. This is called the (dreaded!) DTD, the Document Type Definition, and it works in this way.

Word-processing documents consist of two files -- the 'Content' file which is the information you enter as 'content' -- the 'text' of the message, the information. The second file, seldom seen, but always there, is the internal information which describes how that information will be displayed. It is often called a 'style sheet' because it defines what the 'content' will look like. In the absence of such a file, the 'content' will appear as ASCII text -- the old DOS black background with lighter green letters and numbers of the early computer displays. If you add the second file to describe issues of display and format, your document can be displayed and printed in the attractive, efficient forms we have come to expect -- choices of font, colour, type size, spacing, graphics, tables and, in the case of HTML, of links to other file displays internally and externally to site son the web. This process can become very complex in its own right as Arbor Text and Acrobat and even Word itself can demonstrate. But the process is still using just 2 files to achieve all the effects.

Enter the powerful third file! The DTD!
What this third file does is to 'structure' the document by identifying every 'element' (remember this word) of its 'content' components. It identifies the 'title' with the title element, the 'date' with the date element, each paragraph with a paragraph element, your 'name' with a signature element. The (3rd) file which does this is called the Document Type Definition, the DTD. You can write one, you can use other people's, you can modify existing ones to suit your and your organization's specific needs. This third file obliges the author of the 'content' file to enter content for every part of the DTD file which 'requires' content and lets you add information in any areas which the DTD permits as 'optional', items like the 'title' of the person signing the document, or the email address of the person to whom it is addressed. These optional elements are also designated by the designer of the DTD and can be modified for others' later use.

The DTD is the 3rd file and, every time the document is modified, it must be done so with the DTD 'active', in place behind the content file, showing the author what elements he or she is preparing to add -- like new paragraphs -- or modifying, like adding lists and tables and new sentences inside existing paragraphs or other elements, like changing the words of the 'title'. On completing, the author then passes the modified XML file through the 'parsing' stage of the editor, where all elements are checked to see that all the conditions of the DTD are met and that all content is in a part of the document which allows for that kind of content. The parser would indicate that a 'graphic' should not be in the 'title' element and the author would have to modify that condition before the file could be saved and displayed.

Consider this example.
Here's an example of the duties of the 3rd file. I'm writing a letter in Word to a colleague to ask for information. In the Word version much of what I do makes sense because you, as the reader, know how to attribute the meanings of the various dates to which I refer. But here's the point of my sermon and the value of XML. If we take this file and ask a computer program to identify the 'date' of my sending it to someone -- a potentially crucial issue in a legal argument over ownership of an idea! -- the system cannot identify the date of sending from the other dates to which I refer in the 'content' part of the file, in Word, in any word-processing system or in any HTML expression. Watch and learn.

Dear Professor Harrington, please return the electronic documents I sent to your server on December 4th. I find your use of them to be in violation of our agreement of November 17th, in which we agreed specifically that neither of us would make these available until January 13th of 2002. The Department minutes of November 17th make this point specific and your recent actions of last Monday, the 19th, have caused confusion and problems for your students and mine and this difficulty will continue through the end of term on December 18th, at the final examination.

Paul Beam,
Department of English, The University of Waterloo,
Tuesday, October 31st, 2000

Now it's clear to us the readers -- semantic puzzle-solvers that we are -- that the 'date' of this message, and a very important one for the subsequent process of disciplinary action about to be visited on Professor Harrington (a pseudonym, if you'd not guessed) is . . . wait for it! -- has to be, can only be, must be -- Tuesday, October 31st, 2000! But try to get that information out of a Word or HTML search engine. It can't be done. Each search can identify all 'dates' in the document by simply listing them all. it found them, but what is the information assistance to you?

December 4th
November 17th
January 13th of 2002
November 17th
December 18th
Tuesday, October 31st, 2000

Right. Now I can go back and search the document -- its 'content' part to determine which of these is the basis for my contemplating legal action where the argument rests on when I sent the memo in the first place.

And now another example.
2nd cut at the example: (into which your clever author has introduced another example which will amaze, astound and clarify!)

An indignant and imperious command from a frustrated colleague

Dear Professor Harrington,
Please return the electronic documents entitled "Students -- can't live with 'em; can't live without 'em" which I sent to your server on December 4th. I find your use of them to promote your allegedly forthcoming book, "The Academic Game: Tilt!" to be in violation of our arrangement of November 17th, in which we agreed specifically that neither of us would make this available until January 13th of 2002 for our joint presentation, "."Aloft and aloof -- the Instructor in the New World of Online Learning". The Department minutes of November 17th make this point specific and your recent comments in the Globe and Mail of last Monday, the 19th, have caused confusion and problems for your students and mine and this difficulty will continue through the end of term on December 18th, at the final examination in English 417: The Electronic Document -- Commentary and Dissimulation.

Paul Beam,
Department of English, The University of Waterloo,
Tuesday, October 31st, 2000

Alright! Let's now ask the HTML or Word search engines to identify the 'title' element in this file! The correct answer obviously is: "An indignant and imperious command from a frustrated colleague", but how could either file's structure indicate to the search engine that is true within the welter of other titles?

"Students -- can't live with 'em; can't live without 'em"
"The Academic Game: Tilt!"
"Aloft and aloof -- the Instructor in the New World of Online Learning the Globe and Mail
English 417: The Electronic Document -- Commentary and Dissimulation.

This is a wonderful example of what the 3rd file does. In XML the DTD for my letter would require me to place the date in the element 'date'. It would then automatically add the 'markup'... on either side of the information by which I identified my letter as being sent. When I or my lawyer or my company or another agency then searched on the time pattern of the messages between me and Professor Harrington, each 'real date' would be identified and the documents could be sequenced and searched for other relevant information about them by the system, without the need for human intervention by reading potentially many, many -- yeah, verily, MANY documents in tedious and not very effective ways.

Within that search the XML process would correctly locate and identify the 'title' because it would be the only information in the 'title' element. In this manner all parts of the document can be identified and located. They can be easily integrated into appropriate parts of other documents and they can be made as links and contact points from other parts of the document or from other documents. That's one of the places where XML gets complex, and that's why you will learn about the steps to include interactivity, graphics, multimedia and Java in a series of documents -- letters, reports, manuals in our courses. The 3rd file, the DTD, 'structures' the content of the XML document so that the different 'elements' of information can be isolated by the computer and utilized for other purposes - -retrieving information reliably, presenting the parts of a document in their logical and semantic sequences, reusing parts of the information in different displays - -a university calendar on the web and in a book, for instance.

In a glimpse, that is what 'markup' of 'content' does and why it is so powerful that people who know how to implement it, to work with the 3rd file, get paid from 30 to 50% more than writers who do not. My lesson can go on (and does) but the next steps are inside the course. Come along and learn about them. You can do it because, with this view of the three files, 1 -- Content (which your writer's background already supports), 2 -- style sheets, which you know a bit about already from HTML and Word-processing, and now the 'manager' file, number 3, the DTD demystified a bit, you can concentrate on how the 3rd file does its business and that is what -- and all -- XML is all about.

About the author
Dr. Paul Beam is a professor in the English Department at the University of Waterloo, where he instructs and does research in online learning and technical writing.  He is working with other principals of Online-Learning.com to provide courses in technical subjects and to develop custom online courses for commercial companies.
© 2000 Online-learning.com