| |
XML
XPLaned:
The mystery of the missing 3rd file
'XML' has a reputation -- sometimes deserved -- of being arcane,
obtuse and almost a programming language in its complexity. It can
be, but that's a bit of bad press. I think it's easy enough to explain
what it is as 'more than HTML' and how it differs by being more
than a 'Word' or word-processing exercise and why that makes it
different and very powerful for large organizations and the Web.
As simply as possible, XML (and it's 'parent', SGML before it) uses
a third reference file when any given document is presented for
display in a browser or in print. This is called the (dreaded!)
DTD, the Document Type Definition, and it works in this way.
Word-processing documents consist of two files -- the 'Content'
file which is the information you enter as 'content' -- the 'text'
of the message, the information. The second file, seldom seen, but
always there, is the internal information which describes how that
information will be displayed. It is often called a 'style sheet'
because it defines what the 'content' will look like. In the absence
of such a file, the 'content' will appear as ASCII text -- the old
DOS black background with lighter green letters and numbers of the
early computer displays. If you add the second file to describe
issues of display and format, your document can be displayed and
printed in the attractive, efficient forms we have come to expect
-- choices of font, colour, type size, spacing, graphics, tables
and, in the case of HTML, of links to other file displays internally
and externally to site son the web. This process can become very
complex in its own right as Arbor Text and Acrobat and even Word
itself can demonstrate. But the process is still using just 2 files
to achieve all the effects.
What this third file does is to 'structure' the document by identifying
every 'element' (remember this word) of its 'content' components.
It identifies the 'title' with the title element, the 'date' with
the date element, each paragraph with a paragraph element, your
'name' with a signature element. The (3rd) file which does this
is called the Document Type Definition, the DTD. You can write one,
you can use other people's, you can modify existing ones to suit
your and your organization's specific needs. This third file obliges
the author of the 'content' file to enter content for every part
of the DTD file which 'requires' content and lets you add information
in any areas which the DTD permits as 'optional', items like the
'title' of the person signing the document, or the email address
of the person to whom it is addressed. These optional elements are
also designated by the designer of the DTD and can be modified for
others' later use.
The DTD is the 3rd file and, every time the document is modified,
it must be done so with the DTD 'active', in place behind the content
file, showing the author what elements he or she is preparing to
add -- like new paragraphs -- or modifying, like adding lists and
tables and new sentences inside existing paragraphs or other elements,
like changing the words of the 'title'. On completing, the author
then passes the modified XML file through the 'parsing' stage of
the editor, where all elements are checked to see that all the conditions
of the DTD are met and that all content is in a part of the document
which allows for that kind of content. The parser would indicate
that a 'graphic' should not be in the 'title' element and the author
would have to modify that condition before the file could be saved
and displayed.
Here's an example of the duties of the 3rd file. I'm writing a letter
in Word to a colleague to ask for information. In the Word version
much of what I do makes sense because you, as the reader, know how
to attribute the meanings of the various dates to which I refer.
But here's the point of my sermon and the value of XML. If we take
this file and ask a computer program to identify the 'date' of my
sending it to someone -- a potentially crucial issue in a legal
argument over ownership of an idea! -- the system cannot identify
the date of sending from the other dates to which I refer in the
'content' part of the file, in Word, in any word-processing system
or in any HTML expression. Watch and learn.
Dear Professor Harrington, please return the electronic documents
I sent to your server on December 4th. I find your use of them
to be in violation of our agreement of November 17th, in which
we agreed specifically that neither of us would make these available
until January 13th of 2002. The Department minutes of November
17th make this point specific and your recent actions of last
Monday, the 19th, have caused confusion and problems for your
students and mine and this difficulty will continue through the
end of term on December 18th, at the final examination.
Paul Beam,
Department of English, The University of Waterloo,
Tuesday, October 31st, 2000
Now it's clear to us the readers -- semantic puzzle-solvers that
we are -- that the 'date' of this message, and a very important
one for the subsequent process of disciplinary action about to be
visited on Professor Harrington (a pseudonym, if you'd not guessed)
is . . . wait for it! -- has to be, can only be, must be -- Tuesday,
October 31st, 2000! But try to get that information out of a Word
or HTML search engine. It can't be done. Each search can identify
all 'dates' in the document by simply listing them all. it found
them, but what is the information assistance to you?
December 4th
November 17th
January 13th of 2002
November 17th
December 18th
Tuesday, October 31st, 2000
Right. Now I can go back and search the document -- its 'content'
part to determine which of these is the basis for my contemplating
legal action where the argument rests on when I sent the memo in
the first place.
2nd cut at the example: (into which your clever author has introduced
another example which will amaze, astound and clarify!)
An indignant and imperious command from a frustrated colleague
Dear Professor Harrington,
Please return the electronic documents entitled "Students -- can't
live with 'em; can't live without 'em" which I sent to your server
on December 4th. I find your use of them to promote your allegedly
forthcoming book, "The Academic Game: Tilt!" to be in violation
of our arrangement of November 17th, in which we agreed specifically
that neither of us would make this available until January 13th
of 2002 for our joint presentation, "."Aloft and aloof -- the
Instructor in the New World of Online Learning". The Department
minutes of November 17th make this point specific and your recent
comments in the Globe and Mail of last Monday, the 19th, have
caused confusion and problems for your students and mine and this
difficulty will continue through the end of term on December 18th,
at the final examination in English 417: The Electronic Document
-- Commentary and Dissimulation.
Paul Beam,
Department of English, The University of Waterloo,
Tuesday, October 31st, 2000
Alright! Let's now ask the HTML or Word search engines to identify
the 'title' element in this file! The correct answer obviously is:
"An indignant and imperious command from a frustrated colleague",
but how could either file's structure indicate to the search engine
that is true within the welter of other titles?
"Students -- can't live with 'em; can't live without 'em"
"The Academic Game: Tilt!"
"Aloft and aloof -- the Instructor in the New World of Online
Learning the Globe and Mail
English 417: The Electronic Document -- Commentary and Dissimulation.
This is a wonderful example of what the 3rd file does. In XML the
DTD for my letter would require me to place the date in the element
'date'. It would then automatically add the 'markup'... on either
side of the information by which I identified my letter as being
sent. When I or my lawyer or my company or another agency then searched
on the time pattern of the messages between me and Professor Harrington,
each 'real date' would be identified and the documents could be
sequenced and searched for other relevant information about them
by the system, without the need for human intervention by reading
potentially many, many -- yeah, verily, MANY documents in tedious
and not very effective ways.
Within that search the XML process would correctly locate and identify
the 'title' because it would be the only information in the 'title'
element. In this manner all parts of the document can be identified
and located. They can be easily integrated into appropriate parts
of other documents and they can be made as links and contact points
from other parts of the document or from other documents. That's
one of the places where XML gets complex, and that's why you will
learn about the steps to include interactivity, graphics, multimedia
and Java in a series of documents -- letters, reports, manuals in
our courses. The 3rd file, the DTD, 'structures' the content of
the XML document so that the different 'elements' of information
can be isolated by the computer and utilized for other purposes
- -retrieving information reliably, presenting the parts of a document
in their logical and semantic sequences, reusing parts of the information
in different displays - -a university calendar on the web and in
a book, for instance.
In a glimpse, that is what 'markup' of 'content' does and why it
is so powerful that people who know how to implement it, to work
with the 3rd file, get paid from 30 to 50% more than writers who
do not. My lesson can go on (and does) but the next steps are inside
the course. Come along and learn about them. You can do it because,
with this view of the three files, 1 -- Content (which your writer's
background already supports), 2 -- style sheets, which you know
a bit about already from HTML and Word-processing, and now the 'manager'
file, number 3, the DTD demystified a bit, you can concentrate on
how the 3rd file does its business and that is what -- and all --
XML is all about.
Dr. Paul Beam is a professor in the English Department at the University
of Waterloo, where he instructs and does research in online learning
and technical writing. He is working with other principals of Online-Learning.com
to provide courses in technical subjects and to develop custom online
courses for commercial companies.
© 2000 Online-learning.com
|