Python:Loading and Saving Data

From PrattWiki
Jump to navigation Jump to search

This page will provide information on various different ways to load and save data in Python. It includes descriptions of methods in both the NumPy and pandas modules. You will first need to know exactly what type of file you want to read from or write to since some methods only work on a limited subset of file types.

If you are loading and saving from text files, the np.loadtxt() and np.savetxt() may be sufficient. They are fairly straight forward and can be told work with headers, footers, and comments, as well as different delimiters.

On the other hand, if you are loading from Excel documents, or you want to load a data set into a dataframe, you will want to use methods from pandas. Depending on the form of the data file, you may use pd.read_table(), pd.read_csv(), or pd.read_excel(). Dataframes have their own built-in methods for saving to different file types.

Finally, if you are loading from a MATLAB .mat file, you can do that with a method from scipy.io.

For the example codes below, assume that the following has already run:

import numpy as np
import pandas as pd

np.loadtxt()

The Numpy module has a loadtxt() method that reads a rectangular set of typically numerical data from a text file and returns a Numpy array. By default, this method can read data separated by spaces or tabs. If there are other delimiters (e.g. commas) the method needs the delimiter="" kwarg to establish what the delimiter is. Note that the method will still ignore whitespace, so if there are spaces after the delimiter you do not need to explicitly include that in your delimiter option.

There are other useful kwargs for this method. If your data file has one or more header rows that you want to ignore, you can supply the skiprows=N kwarg. If your data file has comments of some kind - whether at the top or on subsequent lines - you can supply the comments="" kwarg. If there is more than one symbol or set of symbols, you can give the kwarg a list or tuple of symbols. Note that the command will ignore the comment indicator as well as anything after it on that line. All of that is to say, if you were to have the following text in a file called spaces_comments.txt

This is at the top
and should be ignored
XXX this is a comment
1     2 3
# as is this
4 5   6
   7   8   9
XXX and another one!
10 11 12 $ what about now?

you could load it into an array called a with:

a = np.loadtxt("spaces_comments.txt", skiprows=2, comments=("#", "$", "XXX"))

and the result would be:

In [N]: print(a)
[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]]

There are several other options that can be very useful if you want to specify the format of what you are reading (to include storing strings) and to automatically split the data set into separate variables for the data in each column. For the latter, if you have a data file called commas.txt as follows:

1, 2, 3
4, 5, 6
7, 8, 9
10, 11, 12

you could load each column into its own array with:

import numpy as np
x, y, z = np.loadtxt("commas.txt", delimiter=",", unpack=True)

after which you will have:

In []: x
Out[]: array([ 1.,  4.,  7., 10.])

In []: y
Out[]: array([ 2.,  5.,  8., 11.])

In []: z
Out[]: array([ 3.,  6.,  9., 12.])

Note that the results are 1-dimensional arrays - meaning they are neither columns nor rows but...1-dimensional arrays.

np.savetxt()

If you want to save the values in one or more arrays to a text file, you can use Numpy's savetxt() method. The easiest use case is simply to give the command a file name and an array; the command will create a file with the numerical values saved as floating-point numbers with 19 significant figures (!) separated by spaces. Common kwargs include delimiter="" if you want to change the delimiter and fmt="" if you want to change the format. The format string is similar to the format string used in the print command, except instead of beginning with a colon, it begins with a percent sign.

As an example, if we want to generate a variable called rolls containing a 3x5 set of random integers between 1 and 6, inclusive, and then save it to a text file called dice.out, the simplest way to do that would be:

rolls = np.random.randint(1, 7, (3,5))
np.savetxt("dice.out", rolls)

This works, but it produces a file containing numbers like:

5.000000000000000000e+00

It is a bit absurd to have 19 significant figures for integers! We can save the array using integers and include commas between the values with:

np.savetxt("nice_dice.out", rolls, delimiter=",", fmt="%i")

The contents of that file will be:

5,4,1,3,1
1,6,4,4,4
6,2,5,6,2

Pandas

Pandas (the Data Analysis Laboratory) has several options for loading and saving data in text files as well as in Excel spreadsheets. See the Pundit page on Pandas for examples.

MATLAB .mat Files

Scipy has a method for loading .mat files, specifically siopy.io.loadmat(). Note that this will load .mat files for .mat versions 4, 6, and 7 through 7.2. Unless the file was specifically saved in MATLAB with the matfile function or with an option to be version 7.3, files will be saved in version 7.[1]

This method will load the data into a dictionary in python, there the keys will be the names of the original variables in MATLAB and the values will be arrays. For example, if you use the following code to create and save three arrays in MATLAB:

clear
a = rand(2, 3)
b = rand(3, 2)
c = randi([1, 6], 3, 5)
save mymat a b c

you will create a mymat.mat v7 .mat file. You can then load the information in Python:

In [N]: import scipy.io as sio

In [N]: stuff = sio.loadmat('mymat.mat')

In [N]: stuff
Out[N]: 
{'__header__': b'MATLAB 5.0 MAT-file, Platform: PCWIN64, Created on: Sat Jan 01 02:03:04 2021',
 '__version__': '1.0',
 '__globals__': [],
 'a': array([[0.34038573, 0.22381194, 0.25509512],
        [0.58526775, 0.75126706, 0.50595705]]),
 'b': array([[0.69907672, 0.54721553],
        [0.89090325, 0.13862444],
        [0.95929143, 0.14929401]]),
 'c': array([[2, 5, 3, 4, 5],
        [6, 2, 2, 3, 4],
        [2, 6, 2, 3, 4]], dtype=uint8)}

The default case for data type is float; for the c variable, the method kept track of the fact that it was originally a set of integers in MATLAB. To access the information, you index the variable holding the information with the name of the variable you want:

In [N]: stuff["a"]
Out[N]: 
array([[0.34038573, 0.22381194, 0.25509512],
       [0.58526775, 0.75126706, 0.50595705]])

In [N]: stuff["c"] @ stuff["c"].T
Out[N]: 
array([[79, 60, 72],
       [60, 69, 53],
       [72, 53, 69]], dtype=uint8)

where the second line shows matrix multiplying the c array by its own transpose.

If you want Python to act more like MATLAB in terms of loading things into their own names, you can use a loop and the exec command to get Python to make variables with the original names:

import numpy as np
import scipy.io as sio
stuff = sio.loadmat('mymat.mat')

for key in stuff.keys():
    if key[0]=="_":
        continue
    cmd = '{0:} = stuff["{0:}"]'.format(key)
    exec(cmd)

References

  1. MAT-File Versions, MathWorks web site, accessed 1/2021