1.5 What is XML and why do we prefer to use JSON?

XML is a language. Anyway, this is what it thinks about itself. Note – it isn't a programming language, and although it is possible to build a real programming language on top of XML, it wasn't (and still isn't) its native niche. XML is – like JSON – a universal and transparent carrier of any type of data. You can use it to store and transfer documents of virtually any type. For example, the document format produced by MS Office applications (the newer one with file extensions ending with x like docx) utilizes XML to create such different data as spreadsheets, presentations, or texts.

As you probably suspect, XML is much older than JSON. Moreover, it's heavier and less flexible. We can even say that XML seems to be bloated compared to JSON.

We’re not going to teach you how to use XML in Python. We only want to show you how it’s built and what the most important differences between XML and JSON are. It can be intriguing how different they are, although both the solutions were invented for nearly the same purpose.

Take a look – it's a simple sample XML document:

<?xml version = "1.0" encoding = "utf-8"?>
<!-- cars.xml - List of cars ready to sell -->
<!DOCTYPE cars_for_sale SYSTEM "cars.dtd">
<cars_for_sale>
   <car>
      <id>1</id>
      <brand>Ford</brand>
      <model>Mustang</model>
      <production_year>1972</production_year>
      <price currency="USD">35900</price>
   </car>
   <car>
      <id>2</id>
      <brand>Aston Martin</brand>
      <model>Rapide</model>
      <production_year>2010</production_year>
      <price currency="GBP">32000</price>
   </car>
</cars_for_sale>

The document contains a part of an offer published by a very exclusive second-hand car store. It's not a regular store – it's a store with the most legendary cars of all time. Don't expect to be able to buy “just a car” here. They sell special cars for special owner at special prices adjusted to the prestige these cars bring. Moreover, the store has some branches around the world, so your car prices may be expressed in many different currencies.

How does XML cope with such a problem?

Let us lead you through the document and show you some essential aspects.

Let's start with the first line – it plays a very important role:

<?xml version = "1.0" encoding = "utf-8"?>

First of all, it declares that the document contains XML text. Note the very original “parentheses” in which the line is enclosed: <? and ?> .

As you can see, XML uses plain text, and that's what makes it similar to JSON, but we should note that the similarities end at this point.

Note the two phrases built according to the following pattern:

attribute = value

The first (header) line contains two attribute names: version and encoding. We don’t think you’ll have any problem identifying their meaning: the first informs you which version of the XML has been used to encode the document (in fact, there are two versions currently in use: 1.0 and 1.1) while the second says how the document's text is encoded (as you may expect, UTF-8 is a natural choice here).

Thanks to this line, a program responsible for parsing a file’s content is calm about the future. It knows what to expect next.

The second line isn't very important – do you know why?

<!-- cars.xml - List of cars ready to sell -->

It's a comment. It means nothing. The XML parser will ignore it completely.

How do we recognize a comment inside the XML document? It's easy: a comment starts with:

 <!--

and ends with:

-->

Simple, isn't it?

Of course, a comment, like any other XML element, can be spread over more than one line. XML is flexible with that.

The third line is very curious, as it isn't actually XML:

<!DOCTYPE cars_for_sale SYSTEM "cars.dtd">

In general, XML documents may be built in two ways:

The meta-document is a document type definition (hence the dtd suffix visible in the third line.

If you want to aggregate your document with its external definition, you should put the DOCTYPE line inside your XML document. The definition may be located anywhere: it can be placed beforehand at the target parsing server, or it may be put in any Internet location (in this case the DOCTYPE line contains the full URL/URI of the DTD).

To make a long story short, the DOCTYPE line contains:

Note: DTDs use a specialized language named SGML in order to fully describe XML document content. We won't deal with it here. It's a separate and very complicated story – sorry.

It's a bit ironic that we had to move through three complete lines to finally reach something that is actually XML:

<cars_for_sale>
  ...
</cars_for_sale>

The XML document consists of elements. Each element is marked by a pair of tags: an opening tag and a closing tag. Both tags look nearly identical, but the closing tag's name starts with /. Tags can be easily identified as they are enclosed inside < and >.

This means that the element named CuriousTag starts where the following tag is placed:

<CuriousTag>

and ends in the same location where the closing tag exists:

</CuriousTag>

All the text within these tags is the element's content (or simply element’s text).

Note: there are also empty elements, which can be specified in a more comprehensive way. For example, if you need to specify an element like this:

<empty></empty>

(note: there is no content between the tags!) you may want to use the shorter form:

<empty/>

Be aware that an empty element may not be the same as an absent element – the omission of a particular element may be treated as an error by the parser.

Look at the example – the document contains one top-level element (not contained by any other element) named cars_for_sale. This is the root element, which must occur exactly once inside the XML document.

Let's dive into it.

Try to answer the following question: how many cars can you buy here?

   <car>
	...
   </car>
   <car>
	...
   </car>

Yes, two. The answer is obvious as there are two elements named car inside the root element. Each of them starts with the tag <car> and ends with the tag </car>.

Simple? Yes, indisputably.

Let's get into the car and look around.

What can we learn about each of the cars?

   <car>
      <id>1</id>
      <brand>Ford</brand>
      <model>Mustang</model>
      <production_year>1972</production_year>
      <price currency="USD">35900</price>
   </car>

As you can see, the car is described by:

Note: there is something very intriguing inside the opening <price> tag – can you see it?

Note, there is something new inside the tag:

<price currency="USD">35900</price>

This is the attribute.

What is this needed for?

Some values are absolute or almost absolute. For example, a year number is a good example of such an absolute value. We can say that most customers from all over the world when told that the car was built in 1972 will know exactly how old the vehicle is.

But if we tell them that it cost 35,900, they'll immediately ask “in which currency?”.

This is why we need attributes. Of course, we can make some specialized tags named, e.g., <price_in_usd> or <price_in_gpb> but it will be neither easy to parse nor elegant. Attributes work much better here.

As you can see, the <price> tag uses one attribute named currency. Its value can be taken from an international standard named ISO4217, which can make the record absolutely unequivocal.

We should also add that XML allows us to put as many attributes inside a tag as we need.

We hope that XML is clear now, but there is one question that has to be put here: how do we process XML documents in Python?

The question (fortunately or not) has more than one answer. Moreover, it involves some further questions, and one of them seems to be particularly interesting – what is the best model of XML document structure?

The question may surprise you: it's a tree. Of course, not the one with wide branches and green leaves, but the one which is called a graph by mathematicians and computer scientists.

Take a look:

cars_for_sale
|
|____ car
|     |_ id:1
|	 |_ brand: Ford
|	 |_ model: Mustang
|	 |_ production_year: 1972
|	 |_ price(USD): 35900
|
|____ car
	|_ id:2
	   |_ brand: Aston Martin
	   |_ model: Rapide
	   |_ production_year: 2010
	   |_ price(GPB): 32500

Does it remind you of something? A directory tree on your computer's disk, for example?

Okay. It definitely looks nice, but what can we learn from it?

There are many possible Python tools allowing you to create, write, read, parse, and modify XML files. Most of them treat an XML document as a tree consisting of objects, while the objects represent elements.

One of these tools is a package named xml.etree and we’re going to show you two simple snippets that make use of it.

Look at the first snippet in the editor.

import xml.etree.ElementTree
 
cars_for_sale = xml.etree.ElementTree.parse('cars.xml').getroot()
print(cars_for_sale.tag)
for car in cars_for_sale.findall('car'):
    print('\t', car.tag)
    for prop in car:
        print('\t\t', prop.tag, end='')
        if prop.tag == 'price':
            print(prop.attrib, end='')
    print(' =', prop.text)

Let’s comment on the code:

The code produces the following output:

output
cars_for_sale
	 car
		 id = 1
		 brand = Ford
		 model = Mustang
		 production_year = 1972
		 price{'currency': 'USD'} = 35900
	 car
		 id = 2
		 brand = Aston Martin
		 model = Rapide
		 production_year = 2010
		 price{'currency': 'GBP'} = 32000

The xml.etree.ElementTree module may be also used to create, modify and write XML files.

We’ll use it to remove one car from our offer (theFord Mustang) and add a new car to it – look at the editor and see how we did it.

import xml.etree.ElementTree
 
tree = xml.etree.ElementTree.parse('cars.xml')
cars_for_sale = tree.getroot()
for car in cars_for_sale.findall('car'):
    if car.find('brand').text == 'Ford' and car.find('model').text == 'Mustang':
        cars_for_sale.remove(car)
        break
new_car = xml.etree.ElementTree.Element('car')
xml.etree.ElementTree.SubElement(new_car, 'id').text = '4'
xml.etree.ElementTree.SubElement(new_car, 'brand').text = 'Maserati'
xml.etree.ElementTree.SubElement(new_car, 'model').text = 'Mexico'
xml.etree.ElementTree.SubElement(new_car, 'production_year').text = '1970'
xml.etree.ElementTree.SubElement(new_car, 'price', {'currency': 'EUR'}).text = '61800'
cars_for_sale.append(new_car)
tree.write('newcars.xml', method='')

Some explanations will make the sample code clearer – here they go:

As you can see, working with XML doesn't require you to be a rocket scientist, but – if we’re being honest – JSON is more convenient and – last but most important – most currently implemented services use JSON, not XML. It's highly possible that you may encounter a server which implements communication on exchanging XML documents, but JSON is much more popular.

We think you can judge for yourself now where this section’s title came from.