Excluding certain rows while importing data with Numpy

Question

I am generating data-sets from experiments. I end up with csv data-sets that are typically are n x 4 dimensional (n rows; n > 1000 and 4 columns). However, due to an artifact of the data-collection process, typically the first couple of rows and the last couple of rows have only 2 or 3 columns. So a data-set looks like:

8,0,4091
8,0,
8,0,4091,14454
10,0,4099,14454
2,0,4094,14454
8,-3,4104,14455
3,0,4100,14455
....
....
14,-1,4094,14723
0,3,4105,14723
7,0,4123,14723
7,
6,-2,4096,
3,2,

As you can see, the first two rows and the last three don't have the 4 columns that I expect. When I try importing this file using np.loadtxt(filename, delimiter = ','), I get an error. Once I remove the rows which have fewer than 4 columns (first 2 rows, and last 3 rows, in this case), the import works fine.

Two questions:

Why doesn't the usual importing work. I am not sure what is the exact error in this importing. In other words, why is not having the same number of columns in all rows a problem?
As a workaround, I know how to ignore the first two rows while importing the files with numpy np.loadtxt(filename, skiprows= 2), but is there a simple way to also select a fixed number of rows at the bottom to ignore?

Note: This is NOT about finding unique rows in a numpy array. Its more about importing csv data that are non-uniform in the number of columns that each row contains.

Possible duplicate of [Find unique rows in numpy.array](http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array) — Joseph Farah, Jan 18 '17 at 00:16
@JosephFarah This is not about finding unique rows in numpy array. This is about importing csv files with non-uniform structure (rows, columns). I can't even create the numpy array at the moment. — deserthiker, Jan 18 '17 at 01:58

score 2 · Accepted Answer · edited May 23 '17 at 11:45

2

Your question is similar (duplicate) to Using genfromtxt to import csv data with missing values in numpy

1) I'm not sure about why this is the default behavior.

Could be to warn users that the CSV file might be corrupt.
Could be to optimize the array and make it N x M, instead of having multiple column lengths.

2) Use numpy's genfromtext. For this you'll need to know the correct number of columns in advance.

data = numpy.genfromtxt('data.csv', delimiter=',', usecols=[0,1,2,3], invalid_raise=False)

Hope this helps!

edited May 23 '17 at 11:45

Community

1
1

answered Jan 18 '17 at 00:15

rafaelvalle

6,683
3
34
36

This still gives me an error: `ValueError: Some errors were detected ! Line #3 (got 4 columns instead of 3) Line #4 (got 4 columns instead of 3)......` Approach above works. – deserthiker Jan 18 '17 at 02:02
Try it now, I forgot to set invalid_raise to False! – rafaelvalle Jan 18 '17 at 02:09
Nope still doesn't work @rafaelvalle. Still get `ValueError: Some errors were detected ! Line #3 (got 4 columns instead of 3) Line #4 (got 4 columns instead of 3).....` Interestingly at the very end I get :`[[ 8. 0. 4091.] [ 8. 0. nan] [ 8. 0. 4091.] [ 8. 0. nan]] (4, 3)` – deserthiker Jan 18 '17 at 18:20
Please try it again adding the usecols argument. It is a list with the indices of the valid columns. In your example, [0,1,2,3] – rafaelvalle Jan 18 '17 at 18:34

score 1 · Answer 2 · answered Jan 18 '17 at 00:26

1

You can use genfromtxt, which allows to skip lines a the beginning and at the end:

np.genfromtxt('array.txt', delimiter=',', skip_header=2, skip_footer=3)

answered Jan 18 '17 at 00:26

Mike Müller

82,630
20
166
161

Excluding certain rows while importing data with Numpy

2 Answers2