How to Read a Data File in Bulk Using Python

How to Read a Data File in Bulk Using Python

Are you familiar with reading data files (.dat, .csv, etc.) line by line, but unsure about how to read an entire column of data at once? This article will provide a detailed explanation of how to read all the data in a specific column of a file.

Tested Environment
macOS Catalina (10.15.7), Python 3.7.6, Atom Editor 1.44.0

Reading Bulk Data from a dat File

Preparing the dat File

We have prepared a data file named “average_temperature_kyoto_2018.dat” containing average monthly temperatures in Kyoto city for the year 2018.

# averaged temperature in 2018 @ Kyoto city
# 01: month  02: averaged temperature in the daytime
1 3.9
2 4.4
3 10.9
4 16.4
5 20.0
6 23.4
7 29.8
8 29.5
9 23.6
10 18.7
11 13.5
12 8.2

Save this file in the directory “Desktop/Dr.code/python/data-analysis/input_file_all”.

Writing the Code

Save a file named “input_file.py” in the same directory as “average_temperature_kyoto_2018.dat”. Let’s get straight to the point – the code for reading the file is as follows. It’s simpler to write than reading each line individually.

import numpy as np

data_file = 'average_temperature_kyoto_2018.dat'
month = np.loadtxt(data_file, comments='#', usecols = 0)
ave_temperature = np.loadtxt(data_file, comments='#', usecols = 1)

print(month)
print(ave_temperature)

Running the Program

Let’s execute the program mentioned above. Open your terminal, navigate to “Desktop/Dr.code/python/data-analysis/input_file_all”, and run the following command:

python input_file.py

# (Output)
# [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12.]
# [ 3.9  4.4 10.9 16.4 20.  23.4 29.8 29.5 23.6 18.7 13.5  8.2]

If all goes well, you should see the output as shown above. Since there was no explanation earlier, let’s now discuss the functionality of np.loadtxt().

Explanation of the Code

numpy.loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes', max_rows=None)

This is the numpy.loadtxt() function, consisting of 11 parameters. When using it, you don’t need to fill in all 11 parameters; you can choose the ones you need. This function reads the specified columns (usecols) from a file (fname) containing data and returns an array containing the loaded data. In other words, if you write col1 = numpy.loadtxt(fname, usecols=0), the data from the first column of the file named fname will be passed to the array col1. Additionally, by writing import numpy as np, you can use the shorthand np.loadtxt().

Here are the details of each parameter of numpy.loadtxt() (summarized for your reference):

Parameter nameTypeSummary
fnameStringSpecify the name or path of the file you want to read.
dtypedata type(Optional) You can specify the data type of the output.
The default is float.
commentsString(Optional) Specify the character indicating the start of a comment line.
The default is ‘#’, as shown above.
delimiterString(Optional) The string used to separate values.
The default is a space.
convertersdict(Optional) Used to fill missing values in a column.
The default is None.
skiprowsInt(Optional) Specify the number of initial rows to skip, including comment lines.
The default is 0.
usecolsInt(Optional) Specify which columns to load. The default is to load all columns.
For example, usecols=(1,4,5) will load the 2nd, 5th, and 6th columns.
unpackBoolean(Optional) Default is False.
If set to True, separate column data can be stored in separate variables.
ndminInt(Optional) The minimum number of dimensions for the returned array.
Default is 0, and options are 0, 1, and 2.
encodingString(Optional) The encoding used for decoding the input file.
Default is ‘bytes’.
max_rowsInt(Optional) Specifies how many lines to read after skipping rows.
By default, it reads all lines. This can be useful to avoid reading the last few lines.

Taking It a Step Further

By making use of parameters, you can condense the code from above into a single line, omitting just one line from the previously written code.

import numpy as np

data_file = 'average_temperature_kyoto_2018.dat'
month, ave_temperature = np.loadtxt(data_file, usecols = (0,1), unpack=True)

print(month)
print(ave_temperature)

Reading Bulk Data from a CSV File

CSV File

We’ve prepared a file named “average_temperature_kyoto_2018.csv” for you. Just like before, place it in the “Desktop/Dr.code/python/data-analysis/input_file_all” directory. The contents are as follows:

# averaged temperature in 2018 @ Kyoto city
# 01: month  02: averaged temperature in the daytime
1,3.9
2,4.4
3,10.9
4,16.4
5,20.0
6,23.4
7,29.8
8,29.5
9,23.6
10,18.7
11,13.5
12,8.2

Sample Code

To modify np.loadtxt(data_file, comments='#', usecols=0) to np.loadtxt(data_file, comments='#', delimiter=',', usecols=0) is all it takes! Here’s the complete code, and the output will be just like the dat file.

import numpy as np

data_file = 'average_temperature_kyoto_2018.csv'
month = np.loadtxt(data_file, comments='#', delimiter=',', usecols = 0)
ave_temperature = np.loadtxt(data_file, comments='#', delimiter=',', usecols = 1)

print(month)
print(ave_temperature)