XML, as you probably have gathered by now, is a generic markup language that is used to structure data in a way that is easily-readable by humans, and easily parseable by computers. The average web developer doesn’t usually deal with XML. But depending on your proclivities, you may find it interesting. If you do, here is a simple starting point for how to think about it.
XHTML, RSS, and OPML are all subsets of XML, the general language that represents data as a series of nodes. I will use the terms “tag”, “node”, and “element” to be synonymous, although as far as the XML specification goes there are minor differences between the three which are mostly irrelevant to us at the moment. For now, think of a node as a tag, and a tag as a node, and a tag as an element, and an element as a node.
You are already familiar with the general way that XML tags are written, given that you know XHTML. The difference is that XHTML has prescribed set of tags that you can use. XML stands for extensible markup language, which means that in XML you can use whatever tag names you want… it’s up to you to decide what tags to use, and what they mean.
XML syntax rules
There are only a few rules for how to structure XML documents that you should be aware of if you ever work with XML. There is plenty of documentation available online about the syntax of XML, so here is a quick overview of the most important parts:
- The first line in the document must be the xml tag: <?xml version=”1.0″ encoding=”UTF-8″?>
- There must be one root node that contains all other nodes. In the case of XHTML, this will be the <html> node, for RSS it is the <rss> node. For your own custom XML formats, you can call it whatever you want, so long as it surrounds every other tag in the document.
- XML is case-sensitive. So as a general convention, to keep things simple, I recommend you always use lowercase letters.
An example of an XML document
Here is a simple example of an XML document that represents the list of students in a class:
<?xml version="1.0" encoding="UTF-8"?>
<class>
<title>Web Development Intensive</title>
<description>A course about the making websites</description>
<instructor>Bob Shakey</instructor>
<students>
<student>
<first_name>John</first_name>
<last_name>Smith</last_name>
</student>
<student>
<first_name>Mary</first_name>
<last_name>Wahloo</last_name>
</student>
<student>
<first_name>Dmitry</first_name>
<last_name>Johnson</last_name>
</student>
</students>
</class>
You can see a few things about how I have decided to structure this example:
- it is written in plain text – all XML documents are just plain text documents
- the first line is the xml tag, which indicates what version of XML we are using – all XML documents must have the xml tag in the first line
- all tags are lowercase – this is my choice
- there is a root element that contains all other elements – all XML documents must have a root element
- the root element is called <class> – this name and all the other names of elements are words I made up
- elements are be nested, one inside of another. And this nesting is not arbitrary.
- the elements nested inside of the <class> element hold information about that class
- and the element <students> surrounds a list of <student> elements that contain information about each student
XML namespaces
It is possible, with XML, to use namespaces to indicate more information about what set of rules any particular tag should follow. Looking at an example will make this clearer.
Take the hypothetical scenario where there are two different kinds of students who can take classes: students who come from Pratt, and students who come from NYU. And the data about each student comes from two different places: NYU student data comes from an NYU database, and Pratt student data comes from a Pratt database.
You have seen in my class list example above that I am using a tag called <student> that contains info about a student. With namespaces, it’s possible for me to have two versions of the <student> element: one for Pratt students, and another for NYU students.
<nyu:student>
<nyu:first_name>John</nyu:first_name>
<nyu:last_name>Smith</nyu:last_name>
</nyu:student>
<pratt:student>
<nyu:first_name>Mary</nyu:first_name>
<nyu:last_name>Wahloo</nyu:last_name>
</pratt:student>
<nyu:student>
<nyu:first_name>Dmitry</nyu:first_name>
<nyu:last_name>Johnson</nyu:last_name>
</nyu:student>
An XML parser, which understands the rules of XML and knows how to properly read data represented in XML format, will see these two types of “student” tags as two totally separate tag names with no relation to each other.
In fact, each namespace, “nyu” and “pratt”, is usually defined in a definition file, which contains instructions on what tag names and nesting structures are allowed for tags within that namespace. When using a namespace, you must indicate in the XML document a URL to a namespace definition document where these namespace rules are defined.
So, our updated example with namespaces would look like this:
<?xml version="1.0" encoding="UTF-8"?>
<class
xmlns:nyu="http://nyu.edu/some_imaginary_namespace_definition_file.xml"
xmlns:pratt="http://pratt.edu/some_imaginary_namespace_definition_file.xml">
<title>Web Development Intensive</title>
<description>A course about the making websites</description>
<instructor>Bob Shakey</instructor>
<students>
<nyu:student>
<nyu:first_name>John</nyu:first_name>
<nyu:last_name>Smith</nyu:last_name>
</nyu:student>
<pratt:student>
<nyu:first_name>Mary</nyu:first_name>
<nyu:last_name>Wahloo</nyu:last_name>
</pratt:student>
<nyu:student>
<nyu:first_name>Dmitry</nyu:first_name>
<nyu:last_name>Johnson</nyu:last_name>
</nyu:student>
</students>
</class>
Note the two new attributes of the <class> element:
- xmlns:nyu indicates a URL where the “nyu” namespace is defined
- xmlns:pratt defines a URL where the “pratt” namespace is defined
In practice, it is not uncommon for these to be fake URLs that don’t actually point to any definition document. The XML parser will never actually check to make sure that the URLs really contain definition files. These URLs are there more for the humans who happen to read the code.
So for most small to medium-sized projects, you are usually free to use XML as you wish, without worrying about definition files, so long as you declare the namespaces you are using with fake URLs.
On large-scale projects, you will probably want to actually create real definition files to make sure you are complying with whatever namespace specifications you have decided upon for your set of possible XML elements and nesting structures. On large-scale projects, more strict coding standards are usually beneficial to keeping the project from getting too inconsistent and difficult to manage.
Taking a look at RSS as XML
As we know, RSS is a subset of XML, meaning it follows all the rules of XML. There are actually several different specifications which people generally refer to as simply “RSS”: RDF, RSS 2, and Atom, and others. Each specification contains a list of tag names, rules for nesting those tags, and what those tags are supposed to mean. In other words, they are more-or-less namespace specifications.
For almost all the intents and purposes of the average developer, all the competing RSS specifications are equivalent, and don’t listen to anyone who tells you otherwise. Unless you have a very specific reason for picking any particular one, you can just choose one type arbitrarily. Any decent RSS reader, like Google Reader, will be able to deal equally well with any of these types.
Let’s take a look at the RSS that the class blog publishes. To see this RSS feed live on the site, go to the class blog in Firefox at http://wd.onepotcooking.com, click the RSS icon in the address bar of the browser:
and then click “View Source”.
You should see RSS code like this, but with more than one “item” element:
<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" > <channel> <title>Web Development Intensive</title> <atom:link href="http://wd.onepotcooking.com/?feed=rss2" rel="self" type="application/rss+xml" /> <link>http://wd.onepotcooking.com</link> <description>NYU SCPS</description> <pubDate>Mon, 27 Apr 2009 01:57:40 +0000</pubDate> <generator>http://wordpress.org/?v=2.7</generator> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <item> <title>Class 10 - An MVC Social Network Example</title> <link>http://wd.onepotcooking.com/?p=480</link> <comments>http://wd.onepotcooking.com/?p=480#comments</comments> <pubDate>Mon, 27 Apr 2009 01:57:26 +0000</pubDate> <dc:creator>amos</dc:creator> <category><![CDATA[mysql]]></category> <category><![CDATA[php]]></category> <category><![CDATA[xhtml]]></category> <guid isPermaLink="false">http://wd.onepotcooking.com/?p=480</guid> <description> <![CDATA[ this is where a short description of the blog post goes ]]> </description> <content:encoded> <![CDATA[ this is where the full content of the blog post goes ]]> </content:encoded> <wfw:commentRss>http://wd.onepotcooking.com/?feed=rss2&p=480</wfw:commentRss> </item> </channel> </rss>
A few initial observations:
- The document begins with an “xml” tag.
- You can see that the <rss> tag is the root node of the XML document.
- The <rss> tag defines a bunch of namespaces for tag names that are specified in the “content”, “wfw”, “dc”, “atom”, and “sy” namespaces, rather than in the RSS 2.0 specification itself that is the default specification used for this page. Try going to those URLs directly in the browser to see what these namespace specifications look like.
- The <channel> element has basic info about this site
- An <item> element is used to hold the contents of each blog post on the site.
- Some tags have a namespace prefix, and some don’t. Those without a namespace prefix are part of the RSS 2.0 specification, which is the default for this document. Those with a namespace prefix indicate that they are defined in another specification with its own set of tag names. So it’s clearly possible to use tags from a variety of specifications so long as the tags themselves indiciate the specification in which they are defined by using a namespace prefix.
- You see that the contents of some elements are wrapped in <![CDATA[...]]> tags. CDATA tags are used to indicate to any XML parser reading the contents of this document (in this case it would be an RSS reader), that it should ignore any special characters or XML syntax errors contained in the data within. For example, the contents of the <content:encoded> element are the actually contents of the blog post I wrote. I may have made some mistakes in my example code, and my blog posts may therefore have badly formatted XML, or XHTML that I wrote. Putting that data inside the CDATA tag in the XML tells the RSS reader, which would normally throw an error when it encounters badly formatted XML, that it should not try to evaluate whatever is inside that tag as valid XML. So it just ignores it, and treats it as plain text.
I recommend, if you’re interested in this, that you go to multiple news sites and blogs, and click the RSS icon in the address bar, and view the source code of their RSS feeds. You will get a good idea of how commercial sites are structuring their RSS code. Given that RSS is still a new technology that doesn’t always adhere perfectly to the standards defined in the various specifications, the best guide you can have for how you should structure your RSS code is to look at other feeds from the major blog and news companies and follow their example.