Masthead

Subsettings Strings Based on Content

Another common string parsing task is to find strings within a string, often called "keys" or "tags", and then sub-set the string based on where the keys or tags are located.

1. Subsetting Example with KML

KML, or "Keyhole Markup Language", is the format that is used to send spatial data to GoogleEarth. The coordiante values in a KML string are surrounded by "tags" named "latitude" and "longitude". The Python function "find()" will find the first instance of a string within another string.

Diagram of how subsettings strings works with the startindex+length at the start of the string to subset and end index at the end of the subset.

The code above uses find() on TheText to set the StartIndex to point to the first character in the string <latitude>. Then, Length is set to the length of <latitude>. Adding StartIndex and Length gives us an index to the start of the latitude value. Then, EndIndex is set to the start of the string </latitude>. These values are then used to extract the value of the latitude from the text.

2. Subsetting DMS coordinates

Let's say we have a string that contains a coordinate value in Degrees Minutes and Seconds. But we want to pull out just the degree from the string. We can use the "find()" function to find an index into the string and then use the list indexing approach to pull out just the degree portion. We'll need to use the special character "\xf8" for the degree symbol. If you check an ASCII chart you'll see that the degree symbol is at hexadecimal "f8" on the chart.

Try the code below and then add another "find()" to get the single quote that is after the minute in the coordinate and see if you can put it out.

TheCoordinate="40\xB0 21' 32\" E, 105\xB0 30' 40\""  
print("Coordinate="+TheCoordinate) # print the coordinate so we can see it before the conversion
EndOfDegree=TheCoordinate.find("\xB0")
TheDegree=TheCoordinate[:EndOfDegree]
print(TheDegree)

Note: the syntax "TheList[:EndIndex]" will create a new list containing the contents of "TheList" up to, but not including, TheList[EndIndex]. Similarly, you can use "TheList[StartIndex:] to get a new list containing the contents of "TheList" from TheList[StartIndex] to the end of the list. The syntax "TheList[StartIndex:EndIndex]" will create a new list containing the entries from TheList[StartIndex] up to but not including TheList[EndIndex].

After you pull out the elements of a string you may find that there is "white space" on one side or the other of the string. While space includes spaces, tabs, carriage returns, and new line characters. It's a good idea to use Python's String function "strip()" to remove any unwanted white space. Add the code below to the code you entered from above.

TheDegree=TheDegree.strip()
print(TheDegree) 

 

 

Additional Resources

Python Documentation: String functions

 

© Copyright 2018 HSU - All rights reserved.