Masthead

Parsing XML Data

Introduction

Parsing XML is more complicated than most text files because of the hierarchical structure within the data. Their are several XML parsers available with Python. I recommend starting with the

Parsing XML Without a Parser

Sometimes you only want to find certain information within an XML stream. On approach is to just look for the information you are interested in rather than calling one of the parsers is much more work. You have already learned to find strings within a string and then subset the string. You could write your own parser that finds the tag names you are interested in (remember to include the start character "<" but not the end char), and then finds the information within the XML data. This is not recommended for a general purpose solution but can be a quick way to parse files and pull out just the information you need.

As an example, the XML file below contains coordinates in a tag called "coordinates". You could use the functions you've already learned to find the coordinate tags that surround the coordinates, extract the coordinates, and reformat them as desired.

<?xml version="1.0" encoding="UTF-8"?>
   <kml xmlns="http://earth.google.com/kml/2.1">
     <Placemark>
       <name>Simple  placemark</name>
       <description>Fort Collins</description>
       <Point>
          <coordinates>-105,40,0</coordinates>
       </Point>
     </Placemark>
 </kml>

 

To extract the coordinate from the XML above, we can use the "find()" function and subsetting lists that we have learned before.

Key1="<coordinates>"
Key2="</coordinates>"

LengthOfKey1=len(Key1)

# Find the start of the coordinate tag
StartIndex=Test.find(Key1)

# End the end of the coordinate tag
EndIndex=Test.find(Key2,StartIndex+LengthOfKey1)

# Pull the coordinate string from the text
Coordinate=Test[StartIndex+LengthOfKey1:EndIndex]

 

The example above will only work for one coordinate, if there are more, we need to put the code into a loop. One key to this is using "StartIndex" to point to our current location in the file. It is key to set StartIndex to EndIndex plus the length of the end key at the bottom of each loop.

Key1="<coordinates>"
Key2="</coordinates>"
LengthOfKey1=len(Key1)
LengthOfKey2=len(Key2)

#Initialize StartIndex so we go into the loop the first time
StartIndex=0
while (StartIndex!=-1):
    # Find the start of the coordinate tag
    StartIndex=Text.find(Key1,StartIndex)
    
    # Only extract the coordinate if one was found
    if (StartIndex!=-1):
        # Get the end of the coordinate tag
        EndIndex=Text.find(Key2,StartIndex+LengthOfKey1)
        
		# Extract the coordinate and print it out
        Coordinate=Text[StartIndex+LengthOfKey1:EndIndex]
        print(Coordinate)
        
		# Move StartIndex to be after the end key
        StartIndex=EndIndex+LengthOfKey2

You can also write a sophisticated parser that handles complex XML files and contains error checking for poorly formatted XML data. However, Python provides parsers you can use as well.

Paring XML With a Parser (all of my previous classes preferred the method above to this one)

If you take a look at the Python documentation for XML, you'll see a variety of XML parsers. The examples below use the "Expat" parser because it is fast and does not use validation which will call other servers.

First, know that Expat refers to XML tags as "elements". This is just another name for a tag. To use Expat you provide three "handler functions" which will be called as the XML is being parsed. The code below will parse the XML text indicated and call each of the handler functions in turn. Run the code and you'll see the output as the input data is parsed.

import xml.parsers.expat

# Handler functions

def StartElement(name, attrs): # Called at the start of a tag
    print 'Start element:', name, attrs
def EndElement(name): # Called at the end of a tag
    print 'End element:', name
def CharData(data): # Called when there is char (text) data inside the tag
    print 'Character data:', data


# Create the parser
TheParser = xml.parsers.expat.ParserCreate()


# Change the functions in the parser to point to the functions in this file
TheParser.StartElementHandler = StartElement
TheParser.EndElementHandler = EndElement
TheParser.CharacterDataHandler = CharData


# Create a sample XML file to test the parser
TheText="<?xml version='1.0'?>  \
<parent id='top'> \
	<child1 name='paul'>Text goes here</child1>  \
	<child2 name='fred'>\
		More text\
	</child2>  \
</parent>"


# Run the sample XML through the parser.  Our functions will be called and will print information as the string is parsed.
TheParser.Parse(TheText, 1)

Code above was adapted from the samples at: Fast XML Parsing With Expat

The next step would be to have your "Start" and "End" handler functions set a global variable to indicate which tag was currently being parsed. Then you could write out the contents of the tag when needed.

Additional Resources

Python Documentation: Structured Language Parsers

Python Documentation: Fast XML Parsing With Expat

© Copyright 2018 HSU - All rights reserved.