Fast data reading from text file in numpy

Question

How can I speed up the data reading and type converting using numpy? I face in addition the issue of getting numpy.void type objects, because of the heterogeneous arrays as far as I know, instead of ndarrays. I have created a simple test that shows numpy.genfromtxt is slower than pure python code, but I am sure there must be a better way. I couldn't manage to make numpy.loadtxt work.

How can I improve the performance? And how to get ndarray sub-arrays as result?

import timeit
import numpy as np

line = "QUAD4   1       123456  123456781.2345671.2345671.234567        "
text = [line + "\n" for x in range(1000000)]
with open("testQUADs","w") as f:
    f.writelines(text)


setup="""
import numpy as np
"""

st="""
with open("testQUADs", "r") as f:
    fn = f.readlines()
for i, line in enumerate(fn):
    l = [line[0:8], line[8:16], line[16:24], line[24:32], line[32:40], line[40:48], line[48:56], line[56:64], line[64:72], line[72:80]]
    fn[i] = [l[0].strip(), int(l[1]), int(l[2]), int(l[3]), float(l[4]), float(l[5]), float(l[6]), l[7].strip()]
fn = np.array(fn)
"""

stnp="""
array = np.genfromtxt("testQUADs", delimiter=8, dtype="|S8, i4, i4, i4, f8, f8, f8, |S8")
print(array[0])
print(type(array[0]))
"""


print(timeit.timeit(st, setup=setup, number=1))
print(timeit.timeit(stnp, setup=setup, number=1))

Output:

4.560215269000764
(b'QUAD4   ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, b'        ')
<class 'numpy.void'>
6.360823633000109

Well after 6 views I got a negative evaluation, and honestly I do not understand why. I can not believe it. I have read the numpy user manual on this topic, I have had a look to the numpy reference manual on this topic, I have had a look on the web and my question is not solved. — Mantxu, Mar 10 '16 at 19:54
don't mind i will "fix" it, because this question might be interesting also to others;) — MaxU - stand with Ukraine, Mar 10 '16 at 19:56
unfortunately i have a very little numpy experience, so forgive me my ignorance, but would it be possible to read your data using `numpy.fromfile()` - documentation says - "A highly efficient way of reading binary data with a known data-type"... — MaxU - stand with Ukraine, Mar 10 '16 at 20:05
Possible duplicate: http://stackoverflow.com/questions/15096269/the-fastest-way-to-read-input-in-python — Warren Weckesser, Mar 10 '16 at 20:06

hpaulj · Answer 1 · 2016-03-10T22:43:53.807

What you get from

array = np.genfromtxt("testQUADs", delimiter=8, dtype="|S8, i4, i4, i4, f8, f8, f8, |S8")

is a structured array.

array.dtype

will look like

np.dtype("|S8, i4, i4, i4, f8, f8, f8, |S8")

array.shape is the number of rows; it's a 1d array with 8 fields.

array[0] is one element or record of this array; look at its dtype. Don't worry about its type (void is just the type of a compound dtype record).

array['f0'] is the first field, all rows, in this case an array of strings.

You may need to read the dtype and structured array docs in more depth. Many SO posters have been confused about the 1d structured array that genfromtxt produces.

genfromtxt reads the file just like your code does, and splits each line into strings. Then it converts those strings according to the dtype, and collects the results in a list. At the end it assembles that list into array - this 1d array of the specified dtype. Since it is doing more than your code, it's not surprising that it is a bit slower.

loadtxt does much the same, with less power in certain areas.

pandas has a csv reader that is faster because it uses more compiled code. But a dataframe isn't any easier to understand than a structured array.

Your 2 methods don't produce the same thing:

In [105]: line = "QUAD4   1       123456  123456781.2345671.2345671.234567        "

In [106]: txt=[line,line,line]    # a list of lines instead of a file

In [107]: A = np.genfromtxt(txt, delimiter=8, dtype="|S8, i4, i4, i4, f8, f8, f8, |S8")

In [108]: A
Out[108]: 
array([ ('QUAD4   ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '        '),
       ('QUAD4   ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '        '),
       ('QUAD4   ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '        ')], 
      dtype=[('f0', 'S8'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', 'S8')])

Note the dtype; and 3 elements

Your line parser:

In [109]: fn=txt[:]    
In [110]: for i, line in enumerate(fn):
        l = [line[0:8], line[8:16], line[16:24], line[24:32], line[32:40], line[40:48], line[48:56], line[56:64], line[64:72], line[72:80]]
        fn[i] = [l[0].strip(), int(l[1]), int(l[2]), int(l[3]), float(l[4]), float(l[5]), float(l[6]), l[7].strip()]
   .....:     

In [111]: fn
Out[111]: 
[['QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''],
 ['QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''],
 ['QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '']]

In [112]: A1=np.array(fn)

In [113]: A1
Out[113]: 
array([['QUAD4', '1', '123456', '12345678', '1.234567', '1.234567',
        '1.234567', ''],
       ['QUAD4', '1', '123456', '12345678', '1.234567', '1.234567',
        '1.234567', ''],
       ['QUAD4', '1', '123456', '12345678', '1.234567', '1.234567',
        '1.234567', '']], 
      dtype='|S8')

fn is a list of lists, which can have the diverse types of values. But when you put it into an array, it turns everthing into a strings.

I could turn your fn list into a structured array with:

In [120]: np.array([tuple(l) for l in fn],dtype=A.dtype)
Out[120]: 
array([('QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''),
       ('QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''),
       ('QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '')], 
      dtype=[('f0', 'S8'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', 'S8')])

That's the same as A from genfromtxt except for the padding of the strings.

Here's a variation that might be useful, though it might also stretch your knowledge of structured array:

In [132]: dt=np.dtype('a8,(3)i,(3)f,a8')
In [133]: A = np.genfromtxt(txt, delimiter=8, dtype=dt)

A now has 4 fields, two of which have multiple values

A['f1'] will return a (n,3) array of ints.

score 0 · Answer 2 · answered Mar 10 '16 at 23:22

0

You have also :

np.loadtxt

You can use it if you're sure that each row gets the same number of values. But, all is said from the previous answer ;)

answered Mar 10 '16 at 23:22

Fast data reading from text file in numpy

2 Answers2