Have you ever needed to read .csv or .dat files line by line for data analysis? In this article, we will walk you through the process of reading files line by line using Python.
macOS Catalina (10.15.7), Python 3.7.6, Atom Editor 1.44.0
- Steps to Read a File
- Reading .dat Files Line by Line
- Reading CSV Files Line by Line
- [A Little Advanced] Using the ‘with’ Statement for Reading Files Line by Line
Steps to Read a File
The steps to read a file are as follows:
- Prepare the file.
- Write code to read the file.
- Execute the program.
While the basic process remains the same for different file types like dat or csv, the code you write may vary slightly. Let’s start by looking at how to read data from a dat file.
Reading .dat Files Line by Line
Preparing the .dat File
We have prepared a data file named “average_temperature_kyoto_2018.dat” containing the average monthly temperatures for Kyoto city in 2018.
# averaged temperature in 2018 @ Kyoto city # 01: month 02: averaged temperature in the daytime 1 3.9 2 4.4 3 10.9 4 16.4 5 20.0 6 23.4 7 29.8 8 29.5 9 23.6 10 18.7 11 13.5 12 8.2
The first column represents the month, and the second column represents the average daytime temperature. Columns are separated by a space (half-width space). Lines starting with # are comments. The .dat extension indicates a data file, similar to a plain text file.
Save this “average_temperature_kyoto_2018.dat” file in the directory “Desktop/LabCode/python/data-analysis/inputfile_eachrow”.
Code for Reading .dat File Line by Line
Save a file named “inputfile.py” in the same directory as “average_temperature_kyoto_2018.dat”. To cut to the chase, here’s the code for reading the file:
input_data = open('average_temperature_kyoto_2018.dat', 'r') # Read and display one line at a time for row in input_data: # Skip commented lines if row == '#': continue # Store values in variables columns = row.rstrip('\n').split(' ') month = columns ave_temperature = columns # Display contents of each line print(month, ave_temperature) # Close the file input_data.close()
Running the Program
Let’s execute the above program. Open your terminal, navigate to “Desktop/LabCode/python/data-analysis/inputfile_eachrow”.
python inputfile.py # (Output) # 1 3.9 # 2 4.4 # 3 10.9 # 4 16.4 # 5 20.0 # 6 23.4 # 7 29.8 # 8 29.5 # 9 23.6 # 10 18.7 # 11 13.5 # 12 8.2
If all goes well, you should see the above output. Since there wasn’t much explanation provided, let’s delve into the meaning of each line next.
data = open(‘average_temperature_kyoto_2018.dat’, ‘r’)
You open a file using
open(filename). In this case, by using
data = open(filename), you are storing the data contained in that file into the variable
data. The ‘r’ stands for read mode. Since
open() defaults to read mode, you can omit the ‘r’ in this case, and simply write
open('average_temperature_kyoto_2018.dat') without any issues.
for rows in input_data:
for xxx in ooo: is the syntax for a for loop. In the case of
rows that you’ve used, it’s just a variable name, and you can use any variable name you prefer.
The processing steps of this loop are as follows:
- Take one line of data from
oooand store it in
- Execute the operations listed below the
This process (1 → 2) is repeated, and once all the data in
ooo has been processed (up to the last line), the loop concludes.
In this code snippet,
row contains information for each line.
row refers to the first element of the
row list. For instance, if the string “Hello” were stored in
row would represent “H”,
row would represent “e”,
row would represent “l”, and so on.
Hence, the meaning of this line is that if
#, then the specified action is executed.
row = rows.rstrip('\n').split(' ')
In this line of code,
rows.rstrip('\n') is a process that removes the trailing ‘\n’ (newline) from the string stored in
rows. If you were to execute
print(rows), you would observe that, for example, it might be something like “1 3.9\n”. The ‘\n’ is a newline character present right after the ‘3.9’. The
rstrip('\n') function is used to eliminate this newline character.
.split(' ') is used to split the string into substrings wherever there is a space (‘ ‘) character. The resulting substrings are stored in an array, in this case, referred to as
As a result, after this line of code,
row might look something like
['1', '3.9']. This array allows you to access individual elements using
So, when processing data like “1 3.9”, this code snippet effectively breaks it into separate parts and removes any newline characters, making it easier to work with.
[Advanced] Excluding Commented Lines
In the previous explanation, we used
if rows == '#': continue to exclude commented lines. However, by employing Python’s built-in
re.match module, you can write a slightly more streamlined code (and show off your Python skills). When using this module, the entire code structure would look like the following:
import re input_data = open('average_temperature_kyoto_2018.dat', 'r') for rows in input_data: if not re.match('#', rows): # Separation row = rows.rstrip('\n').split(' ') month = row ave_temperature = row # Display contents of each line print(month, ave_temperature) input_data.close()
import reat the beginning of your script.
if not re.match('#', rows):
This line means: If the line doesn’t start with a ‘#’ character, execute the following operations. As a result, lines that begin with ‘#’ will be excluded from processing.
re.match, you can achieve the same outcome in a more compact and elegant manner.
Reading CSV Files Line by Line
If you’re familiar with what a CSV file is, you can easily modify the code mentioned earlier to read CSV files. CSV stands for Comma-Separated Values, representing a type of file where values are separated by commas. You just need to adjust the delimiter from space to comma.
Preparing the CSV File
Let’s prepare a file named “average_temperature_kyoto_2018.csv”. Just like before, place it in “Desktop/LabCode/python/data-analysis/inputfile_eachrow”. Here’s what the content looks like:
# averaged temperature in 2018 @ Kyoto city # 01: month 02: averaged temperature in the daytime 1,3.9 2,4.4 3,10.9 4,16.4 5,20.0 6,23.4 7,29.8 8,29.5 9,23.6 10,18.7 11,13.5 12,8.2
Code for Reading CSV Files Line by Line
rows.rstrip('\n').split(' ') with
rows.rstrip('\n').split(',')! The entire code will look like this, and the output will be the same as when reading the dat file:
import re input_data = open('average_temperature_kyoto_2018.csv', 'r') for rows in input_data: if not re.match('#', rows): row = rows.rstrip('\n').split(',') month = row ave_temperature = row print(month, ave_temperature) input_data.close()
[A Little Advanced] Using the ‘with’ Statement for Reading Files Line by Line
When using the
with statement (
with ~ as ~:), there’s no need to explicitly use
close(), making the code a bit more concise. Here’s an example for reading a CSV file:
import re input_file = 'average_temperature_kyoto_2018.csv' with open(input_file) as input_data: # Read and display one line at a time for rows in input_data: # Skip commented lines if not re.match('#', rows): # Separation row = rows.rstrip('\n').split(',') month = row ave_temperature = row # Display contents of each line print(month, ave_temperature)
with open(input_file) as input_data:
By writing this, the data from
input_file is stored in
input_data. (You can use any name you prefer for
input_data.) Then, you can proceed with reading and processing each line using a
for loop, just like before.
Once you become more comfortable with Python, give the
with statement a try!