Over the past few weeks we have discussed how the modern dynamic web—and the digital humanities projects it hosts—comprise structured data (usually residing in a relational database) that is served to the browser based on a user request where it is rendered in HTML markup. This week we are exploring how these two elements (structured data and mark up) come together in a mainstay of DH methods: encoding texts using XML and the TEI.
XML (eXtensible Markup Language) is a sibling of HTML, but whereas the latter can include formatting instructions telling the browser how to display information, the former is merely descriptive. XML doesn’t do anything, it just describes the data in a regularized, structured way that allows for the easy storage and interchange of information between different applications. Making decisions about how to describe the contents of a text involves interpretive decisions that can pose challenges to humanities scholarship, which we’ll discuss more in the next class on the Text Encoding Initiative (TEI). For now, we’re going to explore the basics of XML and see how we can store, access, and manipulate data.
We went over the main parameters in class, but an excellent primer to XML in the context of DH has been put together by Frédéric Kaplan for his DH101 course and can be viewed in the slideshow below.
Exercise
Since XML is like a database in plain text, we can store it anywhere and use pretty much any programming language to perform operations on it, transform it and dictate how it should be displayed. For this exercise we will extract a simple RSS feed’s data (a ubiquitous form of XML document) and output it as HTML using three different languages: JavaScript, PHP and XSLT. Two of these we’ve already encountered, but more information on all three can be found in the resources section below.
- Download the zipped exercise files here.
- In order to run the scripts you’ll need to put them in the web root of a server environment, or your localhost: we are using MAMP
- Copy xml_examples.zip to your MAMP web root (Applications > MAMP > htdocs) and unzip
- You should have the following files
- feed.xml (the data)
- ParseXML.html
- ParseXML.php
- xls_examples
- ParseXML.xsl
- feed_xsl.xml
- Examine the various files in both a text editor and browser and try to figure out what they do. How are they similar and how do they differ? Are some languages easier for you to understand? Look at how they navigate the hierarchical node tree and compare the XML DOM to the HTML one. Finally, start hacking on them to see if you can change the elements being selected or alter how they are being output to HTML; e.g. can you swap out the <description> for the <link> text?
Resources
XML DOM and JavaScript
w3schools is always a good place to start for the by-the-book definition of a language or standard with some good interactive examples, and their XML DOM tutorial is no exception.
SimpleXML in PHP 5
If your project allows server-side scripting it is MUCH easier to use PHP to parse XML than JavaScript. The w3schools introduction to simpleXML in PHP 5 is solid, but TeamTreehouse.com has a more readable and accessible real-world example of how to parse XML with php’s simpleXML functions.
XSLT
XSLT (eXtensible Stylesheet Language Transformations) is to XML what CSS is to HTML, but it’s also a lot more. More like a programming language within markup tags than a regular markup language, it’s a strange but powerful hybrid that will let you do the same things as the languages above: transform XML data into other XML documents or HTML. Learn the basics at w3schools’ XSLT tutorial. If you’d like a more in depth explanation of how XML and XSLT work together check out this tutorial xmlmaster.org, which is geared at the retail/business world but contains basic information relevant for DH applications.