Reading Text Files

1. File Types

There are a large variety data types (e.g. raster, point, polyline, polygon) used for GIS and even more file types (e.g. CSV, TIFF, IMG, ASC, GRID) to store them in. This actually creates a huge problem in having software and files that are incompatible with one another. There are two broad categories of files types, text files and binary files. Text files can be opened in a text editor like Notepad or MS-Word and include "PRJ" (projection), "XML" (metadata), KML (GoogleEarth), ASCII GRID, CSV (comma separated), and TXT (tab-delimited) files. Text files can also be written and read easily with Python. Binary files require special software and include Shapefile, MXD, Excel, Word, TIFF, JPG, PNG, and IMG files. Binary files can be written and read by Python scripts but this is much more challenging and typically we'll access these files through an application, like ArcGIS.

2. Text Files

The most typical files in scientific research are text files with a specified format.  Sometimes this format will be columns of entries with a specific character to delimit the entries in each column.  These files can be directly opened by Excel.  Comma-delimited files are the most common and may have a file extension of “csv” for comma separated values.  Tab-delimited are also common and may have a file extension of “txt” for Text or “asc” for ASCII.  The only way to be sure is to open the file in a text editor and check the format.  Do not be shy about this as the files will not hurt your application although opening a large file can take several minutes. 

Their are a large number of text files available on the Internet but they can vary in quality quite a bit. I recommend you spend some time browsing the web for data in your area of interest. There is a good chance you'll find a source of point data in a text file format. One source of these files is the "Avian Knowledge Network" (AVN, which contains points files for bird sightings throughout the world. These files can be very large and may need to be processed to select the data your are interested. This is where Python comes in.

3. File Paths

You have been working with file paths when ever you “open” and “save” files in applications on a computer.  The file system is very simple and each file path starts with a drive letter, the a colon, then the name of each folder the file is in starting at the drive and working down the folder the file is actually in.  The path ends in a file name with an extension and each folder name and the final file name is separated by a backslash. 

Below are examples of file paths:

C:/text.txt – a file at the root of the “C” drive
C:/temp/text.txt – a file in the folder named “temp” i the “C” drive

Notice that the "slash" character in my file paths is a "forward slash". You may be more used to seeing the "backslash" ("\") character. The problem with the backslash is that we are already using it for special characters like tabs. I recommend changing the backslashes to forward slashes but you can also use two backslashes ("\\") in file paths.

The “desktop” is contained inside a folder fairly far down a list of folders. My desktop folder is at:

C:\Documents and Settings\jim\Desktop

Folders are also referred to as directories.

Many of the system files are hidden from you when you use windows.  Using Python you can access these files, overwrite them, and even delete them.  This can disable your operating system.  You can also delete your own files.  Be very careful when accessing files and make sure you are accessing the ones you mean to be.

4. Reading Text Files

It is very easy to read a text file in Python and very similar to writing to a text file. The following code will read an entire file into memory:

TheFile=open("C:/Temp/test.txt","r") # open the file for reading (thus the "r") # read the entire file contents into a variable
print(TheContent) # print the entire file contents
TheFile.close() # close the file

When reading a large text file you will probably want to read it one line at a time. This will save time and memory space. The function "readline()" will return one line at a time from a file and then return an empty line ("") when you have finished reading the file. The following code will read the first lines from a file, up to 100 lines:

TheFile=open("C:/Temp/test.txt","r") # open the file for reading (thus the "r")

TheLine=TheFile.readline() # read the first line in the file
while ((TheLine!="") and (NumLines<100)): # while the line is not blank, go through this loop
	TheLine=TheFile.readline() # read the next line in the file
	NumLines+=1 # add one to count the number of lines read


print("Read "+format(NumLines)+" lines from the file")

Try this with several files. Change the maximum number of lines displayed and see how many lines are actually in a file.

Note that in the example above, I had limited the number of lines to be read to 100 to keep my program from looping forever. For your final turn-in, make sure you change this to a very large number (like 1 million).

© Copyright 2018 HSU - All rights reserved.