Numpy IndexError reading csv with genfromtxt and first column string

Question

I have some code that reads a file of tab separated values (tsv) that is working fine when the first column is a number, but fails when it's a string.

import os
import numpy as np

input_file = os.path.normpath('C:/Users/sturaroa/Documents/PycharmProjects/my_file.tsv')

# read values from file, by column
my_data = np.genfromtxt(input_file, delimiter='\t', skip_header=0)
print('my_data\n' + str(my_data))

groups = my_data[:, 0]  # 1st column
X = my_data[:, 1]  # 2nd column
Y = my_data[:, 2]  # 3rd column
errors = my_data[:, 3]  # 4th column (errors)
print('\ngroups ' + str(groups) + '\nX ' + str(X) + '\nY ' + str(Y) + '\nerrors ' + str(errors))

This is the file content (tab separated)

2.4    2    4.0    0.0
2.4    4    8.210526    0.7254761
2.9    4    8.4    0.8081221
2.9    6    12.52    1.0544369

The program prints this

my_data
[[  2.4         2.          4.          0.       ]
 [  2.4         4.          8.210526    0.7254761]
 [  2.9         4.          8.4         0.8081221]
 [  2.9         6.         12.52        1.0544369]]

groups [ 2.4  2.4  2.9  2.9]
X [ 2.  4.  4.  6.]
Y [  4.         8.210526   8.4       12.52    ]
errors [ 0.         0.7254761  0.8081221  1.0544369]

I've seen this question suggesting to use dtype=None. However, if I do that, I get this error

Traceback (most recent call last):
  File "C:/Users/sturaroa/Documents/PycharmProjects/2d_plot_test.py", line 11, in <module>
    groups = my_data[:, 0]  # 1st column
IndexError: too many indices for array

I need to adjust my code to work with an input like this

something    2    4.0    0.0
something    4    8.210526    0.7254761
some_other_thing    8.4    0.8081221
some_other_thing    12.52    1.0544369

This first column is a string of variable length, the other columns are numbers (int or float).

I'm using numpy 1.9.2 on Python 2.7.

score 2 · Accepted Answer · answered May 08 '15 at 18:58

When you read with dtype=None and there are string columns, genfromtxt gives you a structured array. Print my_data, and look at its shape and dtype (and add those to your question).

You access columns of such an array by name, not index. Since you don't use the header or give names, the first column will be accessed with my_data['f0'].

You may need to review the numpy docs on structured arrays.

score 0 · Answer 2 · answered May 08 '15 at 18:48

I was not able to get to the Index of your question

The input for which you want to adjust your code seems to have uneven number of columns. You can either adjust the number of columns or use something like this instead genfromtext and to use np.asmatrix instead, to maintain the matrix structure no matter how many number of columns in the input data. This gives me -

In [1827]: paste
my_data = np.asmatrix([line.split() for line in open('input2.txt')])
print('my_data\n' + str(my_data))

groups = my_data[:, 0]  # 1st column
X = my_data[:, 1]  # 2nd column
Y = my_data[:, 2]  # 3rd column
errors = my_data[:, 3]  # 4th column (errors)
print('\ngroups ' + str(groups) + '\nX ' + str(X) + '\nY ' + str(Y) + '\nerrors ' + str(errors))

## -- End pasted text --
my_data
[[['something', '2', '4.0', '0.0']
  ['something', '4', '8.210526', '0.7254761']
  ['some_other_thing', '8.4', '0.8081221']
  ['some_other_thing', '12.52', '1.0544369']]]

groups [[['something', '2', '4.0', '0.0']]]
X [[['something', '4', '8.210526', '0.7254761']]]
Y [[['some_other_thing', '8.4', '0.8081221']]]
errors [[['some_other_thing', '12.52', '1.0544369']]]

Numpy IndexError reading csv with genfromtxt and first column string

2 Answers2