I am trying to use Numpy to vectorize an operation to parse a text file containing lines of numbers and convert the data into a numpy array. The data in the text file looks like this:
*** .txt file ***
1 0 0 0 0
2 1 0 0 0
3 1 1 0 0
4 0 1 0 0
5 0 0 1 0
6 1 0 1 0
7 1 1 1 0
8 0 1 1 0
9 0.5 0.5 0 0
10 0.5 0.5 1 0
11 0.5 0 0.5 0
12 1 0.5 0.5 0
13 0.5 1 0.5 0
14 0 0.5 0.5 0
*** /.txt file ***
My approach is to read the lines in using file.readlines()
, then convert that list of line strings into a numpy array as follows - file.readlines()
part omitted for testing.
short_list = ['1 0 0 0 0\n',
'2 1 0 0 0\n',
'3 1 1 0 0\n']
long_list = ['1 0 0 0 0\n',
'2 1 0 0 0\n',
'3 1 1 0 0\n',
'4 0 1 0 0\n',
'5 0 0 1 0\n',
'6 1 0 1 0\n',
'7 1 1 1 0\n',
'8 0 1 1 0\n',
'9 0.5 0.5 0 0\n',
'10 0.5 0.5 1 0\n',
'11 0.5 0 0.5 0\n',
'12 1 0.5 0.5 0\n',
'13 0.5 1 0.5 0\n',
'14 0 0.5 0.5 0\n']
def lines_to_npy(lines):
n_lines = len(lines)
lines_array = np.array(lines).astype('S')
tmp = lines_array.tobytes().decode('ascii')
print(repr(tmp))
print(lines_array.dtype)
print(np.array(tmp.split(), dtype=np.int32).reshape(n_lines, -1))
lines_to_npy(short_list)
lines_to_npy(long_list)
Calling the function with short_list
produces the following output:
'1 0 0 0 0\n2 1 0 0 0\n3 1 1 0 0\n'
|S10
[[1 0 0 0 0]
[2 1 0 0 0]
[3 1 1 0 0]]
Which is the desired result (from reading around I gather that "|S10" means that each element in the array is a 10 character string for which the endianness doesn't matter). However, calling with the long list inserts several null characters \x00
at the end of each string which makes it harder to parse.
'1 0 0 0 0\n\x00\x00\x00\x00\x002 1 0 0 0\n\x00\x00\x00\x00\x003 1 1 0 0\n\x00\x00\x00\x00\x004 0 1 0 0\n\x00\x00\x00\x00\x005 0 0 1 0\n\x00\x00\x00\x00\x006 1 0 1 0\n\x00\x00\x00\x00\x007 1 1 1 0\n\x00\x00\x00\x00\x008 0 1 1 0\n\x00\x00\x00\x00\x009 0.5 0.5 0 0\n\x0010 0.5 0.5 1 0\n11 0.5 0 0.5 0\n12 1 0.5 0.5 0\n13 0.5 1 0.5 0\n14 0 0.5 0.5 0\n'
|S15
Note that an error was raised in my function when loading the null characters into an array, preventing a final result. I know that a "cheap and dirty" solution would be to just strip the null characters off the end. I also know that I could use Pandas to accomplish the main goal, too, but I'd like to understand why this behavior is occurring.
The \x00
are padded at the end of each string to make each string of length 15. This kind of makes sense, because the dtype
of the short array was |S10
, and each string just happened to be 10 characters long. The long array contains 14 strings, the dtype
was |S15
, and extra \x00
are appended to make the length of each item in the array 15 characters.
I am confused because the number of elements in the list of strings (3 vs 14) has no correlation to the length of each string, so I don't understand why the dtype changes to |S15
when adding more list elements.
Update: I did some more research on ways to efficiently read in data from a text file to a numpy array. I need a fast method for doing this because I am reading files with ~10M lines. numpy.loadfromtxt()
and numpy.genfromtxt()
are candidate solutions, but they are very slow because they are implemented in Python and basically do the same thing as manually looping through file.readlines()
, stripping, and splitting the line strings (source). I noticed in my own testing that using numpy.loadtxt() was about twice as slow as the aforementioned manual method, which was also noted here.
I found that using pandas.from_csv().to_numpy()
, I was able to get a speedup of ~10x that of looping through file.readlines()
. See this answer here. Hopefully this helps anyone in the future with the same application.