7

I ran into the following problem with NumPy 1.10.2 when reading a CSV file. I cannot figure out how to give explicit datatypes to genfromtxt.

Here is the CSV, minimal.csv:

x,y
1,hello
2,hello
3,jello
4,jelly
5,belly

Here I try to read it with genfromtxt:

import numpy
numpy.genfromtxt('minimal.csv', dtype=(int, str))

I also tried:

import numpy
numpy.genfromtxt('minimal.csv', names=True, dtype=(int, str))

Anyway, I get the error:

Traceback (most recent call last):
  File "visualize_numpy.py", line 39, in <module>
    numpy.genfromtxt('minimal.csv', dtype=(int, str))
  File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/npyio.py", line 1518, in genfromtxt
    replace_space=replace_space)
  File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/_iotools.py", line 881, in easy_dtype
    ndtype = np.dtype(ndtype)
ValueError: mismatch in size of old and new data-descriptor

Alternatively, I tried:

import numpy
numpy.genfromtxt('minimal.csv', dtype=[('x', int), ('y', str)])

Which throws:

Traceback (most recent call last):
  File "visualize_numpy.py", line 39, in <module>
    numpy.genfromtxt('minimal.csv', dtype=[('x', int), ('y', str)])
  File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/npyio.py", line 1834, in genfromtxt
    rows = np.array(data, dtype=[('', _) for _ in dtype_flat])
ValueError: size of tuple must match number of fields.

I known dtype=None makes NumPy to try to guess correct types and usually works well. However, the documentation mentions it to be much slower than explicit types. In my case the computational efficiency is required so dtype=None is not an option.

Is there something terribly wrong with my approach or NumPy?

Akseli Palén
  • 27,244
  • 10
  • 65
  • 75
  • 2
    I had a very similar problem, which I solved by giving the dtype as a list instead of a tuple, and it seems the same is true for your case. – pela Sep 09 '17 at 08:16

3 Answers3

3

This works well, and preserves your header information:

df = numpy.genfromtxt('minimal.csv',
                      names=True,
                      dtype=None,
                      delimiter=',')

This makes genfromtxt guess the dtype, which is generally what you want. Delimiter is a comma, so we should pass that argument also and finally, names=True preserves the header information.

Simply access your data as you would with any frame:

>>>>print(df['x'])
[1 2 3 4 5]

Edit: as per your comment below, you could provide the dtype explicitly, like so:

df = numpy.genfromtxt('file1.csv',
                      names=True,
                      dtype=[('x', int), ('y', 'S5')], # assuming each string is of len =< 5
                      delimiter=',')
Nelewout
  • 6,281
  • 3
  • 29
  • 39
  • Thanks! And sorry, unfortunately `dtype=None` is not suitable in my case due to its slowness. I added this to the question. I just cannot figure out how to give the types to genfromtxt explicitly. – Akseli Palén Dec 16 '15 at 17:48
  • @AkseliPalén, see my updated answer! I hope this helps :) – Nelewout Dec 16 '15 at 17:55
0

From briefly glancing at the documentation, the default delimiter=None.

Try numpy.genfromtxt('minimal.csv', dtype=(int, str), names=True, delimiter=',')

pushkin
  • 9,575
  • 15
  • 51
  • 95
0

I am in the same position where I am not sure why my provided types are throwing an error. That said, this might be a workable solution for you. Here's an example using my data set, which seems similar to yours.

First, load some of the data and inspect the actual dtypes NumPy uses:

>>> movies = np.genfromtxt('movies.csv', delimiter='|', dtype=None)
>>> movies
array([(1, 'Toy Story (1995)'), (2, 'GoldenEye (1995)'),
       (3, 'Four Rooms (1995)'), ..., (1680, 'Sliding Doors (1998)'),
       (1681, 'You So Crazy (1994)'),
       (1682, 'Scream of Stone (Schrei aus Stein) (1991)')],
      dtype=[('f0', '<i8'), ('f1', 'S81')])

Then load all your data using the detected types:

>>> movies = np.genfromtxt('movies.csv', delimiter='|', 
                           dtype=[('f0', '<i8'), ('f1', 'S81')]) 

This is admittedly not as satisfactory as knowing why NumPy is throwing the error, but it works for your specific use case.

jds
  • 7,910
  • 11
  • 63
  • 101