Parsing XML is more complicated than most text files because of the hierarchical structure within the data. Their are several XML parsers available with Python. I recommend starting with the
Sometimes you only want to find certain information within an XML stream. On approach is to just look for the information you are interested in rather than calling one of the parsers is much more work. You have already learned to find strings within a string and then subset the string. You could write your own parser that finds the tag names you are interested in (remember to include the start character "<" but not the end char), and then finds the information within the XML data. This is not recommended for a general purpose solution but can be a quick way to parse files and pull out just the information you need.
As an example, the XML file below contains coordinates in a tag called "coordinates". You could use the functions you've already learned to find the coordinate tags that surround the coordinates, extract the coordinates, and reformat them as desired.
<?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://earth.google.com/kml/2.1"> <Placemark> <name>Simple placemark</name> <description>Fort Collins</description> <Point> <coordinates>-105,40,0</coordinates> </Point> </Placemark> </kml>
To extract the coordinate from the XML above, we can use the "find()" function and subsetting lists that we have learned before.
Key1="<coordinates>" Key2="</coordinates>" LengthOfKey1=len(Key1) # Find the start of the coordinate tag StartIndex=Test.find(Key1) # End the end of the coordinate tag EndIndex=Test.find(Key2,StartIndex+LengthOfKey1) # Pull the coordinate string from the text Coordinate=Test[StartIndex+LengthOfKey1:EndIndex]
The example above will only work for one coordinate, if there are more, we need to put the code into a loop. One key to this is using "StartIndex" to point to our current location in the file. It is key to set StartIndex to EndIndex plus the length of the end key at the bottom of each loop.
Key1="<coordinates>" Key2="</coordinates>" LengthOfKey1=len(Key1) LengthOfKey2=len(Key2) #Initialize StartIndex so we go into the loop the first time StartIndex=0 while (StartIndex!=-1): # Find the start of the coordinate tag StartIndex=Text.find(Key1,StartIndex) # Only extract the coordinate if one was found if (StartIndex!=-1): # Get the end of the coordinate tag EndIndex=Text.find(Key2,StartIndex+LengthOfKey1) # Extract the coordinate and print it out Coordinate=Text[StartIndex+LengthOfKey1:EndIndex] print(Coordinate) # Move StartIndex to be after the end key StartIndex=EndIndex+LengthOfKey2
You can also write a sophisticated parser that handles complex XML files and contains error checking for poorly formatted XML data. However, Python provides parsers you can use as well.
If you take a look at the Python documentation for XML, you'll see a variety of XML parsers. The examples below use the "Expat" parser because it is fast and does not use validation which will call other servers.
First, know that Expat refers to XML tags as "elements". This is just another name for a tag. To use Expat you provide three "handler functions" which will be called as the XML is being parsed. The code below will parse the XML text indicated and call each of the handler functions in turn. Run the code and you'll see the output as the input data is parsed.
import xml.parsers.expat # Handler functions def StartElement(name, attrs): # Called at the start of a tag print 'Start element:', name, attrs def EndElement(name): # Called at the end of a tag print 'End element:', name def CharData(data): # Called when there is char (text) data inside the tag print 'Character data:', data # Create the parser TheParser = xml.parsers.expat.ParserCreate() # Change the functions in the parser to point to the functions in this file TheParser.StartElementHandler = StartElement TheParser.EndElementHandler = EndElement TheParser.CharacterDataHandler = CharData # Create a sample XML file to test the parser TheText="<?xml version='1.0'?> \ <parent id='top'> \ <child1 name='paul'>Text goes here</child1> \ <child2 name='fred'>\ More text\ </child2> \ </parent>" # Run the sample XML through the parser. Our functions will be called and will print information as the string is parsed. TheParser.Parse(TheText, 1)
Code above was adapted from the samples at: Fast XML Parsing With Expat
The next step would be to have your "Start" and "End" handler functions set a global variable to indicate which tag was currently being parsed. Then you could write out the contents of the tag when needed.
Python Documentation: Structured Language Parsers
Python Documentation: Fast XML Parsing With Expat
© Copyright 2018 HSU - All rights reserved.