How to Read Data Files Line by Line in Python

How to Read Data Files Line by Line in Python

Have you ever needed to read .csv or .dat files line by line for data analysis? In this article, we will walk you through the process of reading files line by line using Python.

Tested Environment

macOS Catalina (10.15.7), Python 3.7.6, Atom Editor 1.44.0

Steps to Read a File

The steps to read a file are as follows:

  1. Prepare the file.
  2. Write code to read the file.
  3. Execute the program.

While the basic process remains the same for different file types like dat or csv, the code you write may vary slightly. Let’s start by looking at how to read data from a dat file.

Reading .dat Files Line by Line

Preparing the .dat File

We have prepared a data file named “average_temperature_kyoto_2018.dat” containing the average monthly temperatures for Kyoto city in 2018.

# averaged temperature in 2018 @ Kyoto city
# 01: month 02: averaged temperature in the daytime
1 3.9
2 4.4
3 10.9
4 16.4
5 20.0
6 23.4
7 29.8
8 29.5
9 23.6
10 18.7
11 13.5
12 8.2

The first column represents the month, and the second column represents the average daytime temperature. Columns are separated by a space (half-width space). Lines starting with # are comments. The .dat extension indicates a data file, similar to a plain text file.

Save this “average_temperature_kyoto_2018.dat” file in the directory “Desktop/LabCode/python/data-analysis/inputfile_eachrow”.

Code for Reading .dat File Line by Line

Save a file named “inputfile.py” in the same directory as “average_temperature_kyoto_2018.dat”. To cut to the chase, here’s the code for reading the file:

input_data = open('average_temperature_kyoto_2018.dat', 'r')

# Read and display one line at a time
for row in input_data:
    # Skip commented lines
    if row[0] == '#':
        continue
    
    # Store values in variables
    columns = row.rstrip('\n').split(' ')
    month = columns[0]
    ave_temperature = columns[1]
    
    # Display contents of each line
    print(month, ave_temperature)

# Close the file
input_data.close()

Running the Program

Let’s execute the above program. Open your terminal, navigate to “Desktop/LabCode/python/data-analysis/inputfile_eachrow”.

python inputfile.py

# (Output)
# 1 3.9
# 2 4.4
# 3 10.9
# 4 16.4
# 5 20.0
# 6 23.4
# 7 29.8
# 8 29.5
# 9 23.6
# 10 18.7
# 11 13.5
# 12 8.2

If all goes well, you should see the above output. Since there wasn’t much explanation provided, let’s delve into the meaning of each line next.

Code Explanation

data = open(‘average_temperature_kyoto_2018.dat’, ‘r’)

You open a file using open(filename). In this case, by using data = open(filename), you are storing the data contained in that file into the variable data. The ‘r’ stands for read mode. Since open() defaults to read mode, you can omit the ‘r’ in this case, and simply write open('average_temperature_kyoto_2018.dat') without any issues.

for rows in input_data:

for xxx in ooo: is the syntax for a for loop. In the case of rows that you’ve used, it’s just a variable name, and you can use any variable name you prefer.

The processing steps of this loop are as follows:

  1. Take one line of data from ooo and store it in xxx.
  2. Execute the operations listed below the for statement.

This process (1 → 2) is repeated, and once all the data in ooo has been processed (up to the last line), the loop concludes.

if row[0]=='#':

In this code snippet, row contains information for each line. row[0] refers to the first element of the row list. For instance, if the string “Hello” were stored in row, then row[0] would represent “H”, row[1] would represent “e”, row[2] would represent “l”, and so on.

Hence, the meaning of this line is that if row[0] equals #, then the specified action is executed.

row = rows.rstrip('\n').split(' ')

In this line of code, rows.rstrip('\n') is a process that removes the trailing ‘\n’ (newline) from the string stored in rows. If you were to execute print(rows), you would observe that, for example, it might be something like “1 3.9\n”. The ‘\n’ is a newline character present right after the ‘3.9’. The rstrip('\n') function is used to eliminate this newline character.

Following this, .split(' ') is used to split the string into substrings wherever there is a space (‘ ‘) character. The resulting substrings are stored in an array, in this case, referred to as row.

As a result, after this line of code, row might look something like ['1', '3.9']. This array allows you to access individual elements using row[0] and row[1].

So, when processing data like “1 3.9”, this code snippet effectively breaks it into separate parts and removes any newline characters, making it easier to work with.

[Advanced] Excluding Commented Lines

In the previous explanation, we used if rows[0] == '#': continue to exclude commented lines. However, by employing Python’s built-in re.match module, you can write a slightly more streamlined code (and show off your Python skills). When using this module, the entire code structure would look like the following:

import re

input_data = open('average_temperature_kyoto_2018.dat', 'r')

for rows in input_data:
    if not re.match('#', rows):
        # Separation
        row = rows.rstrip('\n').split(' ')
        month = row[0]
        ave_temperature = row[1]
        
        # Display contents of each line
        print(month, ave_temperature)

input_data.close()

MEMO
When using the re.match module, be sure to include import re at the beginning of your script.
if not re.match('#', rows):

This line means: If the line doesn’t start with a ‘#’ character, execute the following operations. As a result, lines that begin with ‘#’ will be excluded from processing.

By using re.match, you can achieve the same outcome in a more compact and elegant manner.

Reading CSV Files Line by Line

If you’re familiar with what a CSV file is, you can easily modify the code mentioned earlier to read CSV files. CSV stands for Comma-Separated Values, representing a type of file where values are separated by commas. You just need to adjust the delimiter from space to comma.

Preparing the CSV File

Let’s prepare a file named “average_temperature_kyoto_2018.csv”. Just like before, place it in “Desktop/LabCode/python/data-analysis/inputfile_eachrow”. Here’s what the content looks like:

# averaged temperature in 2018 @ Kyoto city
# 01: month  02: averaged temperature in the daytime
1,3.9
2,4.4
3,10.9
4,16.4
5,20.0
6,23.4
7,29.8
8,29.5
9,23.6
10,18.7
11,13.5
12,8.2

Code for Reading CSV Files Line by Line

Simply replace rows.rstrip('\n').split(' ') with rows.rstrip('\n').split(',')! The entire code will look like this, and the output will be the same as when reading the dat file:

import re

input_data = open('average_temperature_kyoto_2018.csv', 'r')

for rows in input_data:
    if not re.match('#', rows):
        row = rows.rstrip('\n').split(',')
        month = row[0]
        ave_temperature = row[1]
        print(month, ave_temperature)

input_data.close()

[A Little Advanced] Using the ‘with’ Statement for Reading Files Line by Line

When using the with statement (with ~ as ~:), there’s no need to explicitly use close(), making the code a bit more concise. Here’s an example for reading a CSV file:

import re

input_file = 'average_temperature_kyoto_2018.csv'

with open(input_file) as input_data:
    # Read and display one line at a time
    for rows in input_data:
        # Skip commented lines
        if not re.match('#', rows):
            # Separation
            row = rows.rstrip('\n').split(',')
            month = row[0]
            ave_temperature = row[1]
            
            # Display contents of each line
            print(month, ave_temperature)
with open(input_file) as input_data:

By writing this, the data from input_file is stored in input_data. (You can use any name you prefer for input_data.) Then, you can proceed with reading and processing each line using a for loop, just like before.

Once you become more comfortable with Python, give the with statement a try!