Home     Contact     
     
E-Court-Filing
Transcripts
Georgia Electronic Records and Signatures
Electronic Records and Signatures
XML
Projects
Papers
Presentations
College of Law
 
 
 
 
·
Education 
What is XML?
by Winchel 'Todd' Vincent III
4/4/2001

XML is a Tool, Not a Religion; It Can Be Used Both Wisely and Unwisely

It helps to explain what XML is not before explaining what XML is. XML is not a religion, although advocates of the technology, including this author, are sometimes considered rather zealous. XML is not a solution to all the world's problems. XML is not a solution to all electronic commerce problems. XML is not a solution to all legal technology problems. Indeed, XML, if not used properly, has the potential to cause a lot of problems rather than solve them.

XML, simply, is a technical tool. It is an information management tool. To be sure, XML is a very cool, powerful tool. But, it is a tool among many other information management tools, including databases and HTML. XML does not replace databases or HTML. It augments them.

An unfortunate side effect of XML "gospel" is the misconception that all products that use XML are good products. In fact, tool can be used wisely and unwisely, depending on the knowledge and motives of the user. For instance, a hammer can be used to drive nails and construct a house - a social purpose. Likewise, it can be use to hit someone - an antisocial purpose. XML, as an information management tool, is no different.

Indeed, XML can be used for social and antisocial purposes. XML's power can be used to manage and share information. On the other hand, its power can be used to trap information and hinder its usability. The point is, it is a mistake to conclude, just because a label on a product says "XML", that the product is a good product, that it is good for the legal business you conduct, the legal industry generally, or even citizens and society.

Electronic Document Formats

Understanding XML is easier if one first grasps the characteristics and varieties of electronic document formats. A "document format" is a technical "grammar" (or "syntax") that defines an electronic document. There are three types of information that can be captured by electronic document formats and three classes of formats.

The types of information that can be captured in electronic documents are:

  • Formatting: Formatting is how text in a document looks to the reader. For instance, bold, italic, and underline are all types of formatting.
  • Logical Structure: Logical structure is the relationship of grammatical parts of a document to other parts of the document. For instance, a book has chapters; chapters have titles, headings and subheadings; headings and subheadings have paragraphs; paragraphs have sentences. The fact that information in a subheading appears beneath information in a heading tells the reader that the subheading is a logical subset of the heading and not the other way around. This is an important organizational cue that helps a reader better understand the information. People often confuse formatting with logical structure because formatting is often used to highlight logical structure (as has been done in this text, assuming the printer followed the styles of the author). The three most important types of logical structures in documents are (1) tables (2) outlines (also called lists) and (3) paragraphs. In the law, these structures are often given specific names, such as clauses or sections, but the logical structure is the same.
  • Data: Data are pieces of information, such as "plaintiff," "defendant," "name", and "address." Data are often stuffed into databases so that is has some structure and can be indexed, sorted, and retrieved easily. Data also appears in documents. In legal documents, pieces of data are either randomly sprinkled throughout the document (i.e., "unstructured text" ) or structured at the top or bottom of a document. For example, a caption on a pleading is an example of "data" that also has "logical structure." Thus, in the U.S. we always know that "Plaintiff" comes before "Defendant" in a caption on a pleading, whereas "Plaintiff" and "Defendant" can also be sprinkled throughout paragraphs, tables, and outlines (logical structure) in a document.

The classes of document formats are:

The classes of document formats are:

  • Page Description Formats: Page description formats are formats that strictly capture a document's layout (a combination of formatting and logical structure). Examples are Adobe's Portable Document Format (PDF) and Word Perfect's Envoy format. Images, such as TIFFs, are also page descriptions formats. Page description formats can be either binary or text-based.
  • Mark-up-based Formats: Mark-up-based formats may capture formatting, but are intended to capture data and logical structure. Mark-up-based languages are formats that capture logical structure and data by surrounding text within "tags." Mark-up-based formats usually use "stylesheets" to capture formatting. Hypertext Markup Language ("HTML") (the format used to create web pages) is an example of a mark-up format. However, HTML does not separate formatting from logical structure. It cannot, therefore, capture a rich variety of data. Mark-up formats are usually text-based formats.
  • Compound Document Formats: Compound document formats are formats that capture a mix of formatting and logical structure. Examples of compound document formats are Rich Text Format (RTF), Microsoft Word, and Word Perfect. Compound document formats can be either binary or text-based. For instance, RTF is text-based, while Word and Word Perfect are binary. Compound document formats are different than page description formats because they do not strictly capture layout. For instance, if you import an RTF document created in Microsoft Word into Word Perfect, the formatting will often change. Likewise, the same document opened in different installations of Microsoft Word, even if it is the same version of the program, will sometimes have different formatting (i.e., pages, lines, and fonts sometimes change).

XML is not a document format, although there are document formats created from XML. Rather, XML is a grammar (or syntax) for defining markup-based document formats.

XML Terminology

XML and related standards are used to create "document formats" by combining customized "elements" and, optionally, "stylesheets." XML elements look similar to Hypertext Markup Language ("HTML") elements. For example, the following is an HTML element:

<FONT Size='12'>I agree to give you a peppercorn in exchange for your services.</FONT>

An element is a combination of a "begin tag" and "end tag" and everything in between the two tags. Some elements may only contain text (called "PC data"). Some elements may contain other elements (i.e., tags and text). Elements may also be empty (i.e., contain no text). Elements may have "attributes," which are assigned "values." In the example above, the element name is "FONT." The "FONT" element has an attribute named "Size" with an attribute value of "12." Elements are nested within other elements to create a hierarchy of "marked-up" text. A complete hierarchy of marked-up text is an "XML document." The following is an example of a simple XML document with legal elements.

<Legal>
<Contract>
<Clause>
<Paragraph>
I agree to give you a peppercorn in exchange for your services.
</Paragraph>
</Clause>
</Contract>
</Legal>

HTML is a standardized set of about 90 pre-defined elements that web designers use to create HTML documents (web pages). The problem with HTML is that it is a dumb "document format." Indeed, a significant disadvantage of HTML is that most of its predefined set of tags do not have a meaningful relationship to the text within them. For example, the following HTML element, with the addition of the "color" attribute, would look colorful in a web browser:

<FONT Color='Red'>I agree to give you a peppercorn in exchange for your services.</FONT>

However, the HTML <FONT> element does not provide meaningful information to a reader, a search engine, or any other information system about the meaning of the text within the element. A web browser knows it should display the text in red, but it knows nothing else about the text. More meaningful mark-up would look like this:

<Contract Color='Red'>I agree to give you a peppercorn in exchange for your services.</Contract>

Unfortunately, custom elements such as <Contract> are not allowed in HTML because they are not defined by the HTML standard. XML, unlike HTML, is not a set of defined elements. Rather, XML is a "grammar" (or "syntax") that can be used to define any number of custom elements. Using XML, the developers of document formats can create industry-specific (e.g., legal-specific) elements, such as <Contract>, <CourtFiling>, or <Transcript>, that can hold information important within their industry.

XML Document Type Definition (DTD)

A "document type definition" ("DTD") is a set of rules that define the type, number, and order of elements that may appear in an XML document. Knowing all the rules that govern DTDs is more than a non-technical legal professional needs to know. It is useful to understand the basic mechanics, however, so one can read and understand DTDs and comment on related matters of legal substance. However, one does not need to know or remember the following unless there is a desire to build the technology.

The rules of a DTD are set out in "declarations." The following is a set of declarations that define an "Address" document in XML:

<?xml version="1.0" ?>
<!DOCTYPE Address [
<!ELEMENT Address (Street+,City,State,PostalCode,Country) >
<!ELEMENT Street (#PCDATA) >
<!ELEMENT City (#PCDATA) >
<!ELEMENT State (#PCDATA) >
<!ELEMENT PostalCode (#PCDATA) >
<!ELEMENT Country (#PCDATA) >
]>

The first declaration in the DTD is for the "Address" element. According to the declaration, the "Address" element may contain five other elements, namely, "Street," "City," "State," "PostalCode," and "Country." Each of these elements may contain "#PCDATA," which is text. The "+" on the end of "Street" means there may be one or more "Street" elements within "Address."

There are other rules used to define the number of elements that may appear in the document. Additionally, attributes, as shown above in the HTML example, may be specified in declarations corresponding to individual elements. There are numerous other XML rules that are beyond the scope of this paper.

Well-Formed XML

"Well-formed XML" is an XML document that follows the simple rules that (1) every document must have a single root element and (2) for every element there must be a "begin tag" and a corresponding "end tag" that does not overlap with other begin and end tags. Note, elements (begin and end tag combinations) may be nested within other elements, but tags may not overlap. For instance, the following is well-formed XML (that just happens to correspond to the "Address" DTD example above):

<Address>
<Street>2356 Peachtree Street</Street>
<Street>Suite 2000</Street>
<City>Atlanta</City>
<State>Georgia</State>
<PostalCode>30302</PostalCode>
<Country>U.S.A.</Country>
</Address>

This, however, is illegal according to the XML specification:

<Bold>This is some<Italics>bad</Bold> XML</Italics>.

It is possible, although not required, to "validate" well-formed XML by using a DTD. Validating well-formed XML with a DTD means that the structure of the well-formed XML is checked by software to see if it matches the rules specified in the declarations of the DTD. If the well-formed XML does not conform to the DTD, then the software (if it conforms to the XML standard) will report an error.


 
  home I e-court-filing I electronic records & signatures I contact
 
Copyright © 1997 - 2005 Winchel "Todd" Vincent III All Rights Reserved.