-1

I am generating data-sets from experiments. I end up with csv data-sets that are typically are n x 4 dimensional (n rows; n > 1000 and 4 columns). However, due to an artifact of the data-collection process, typically the first couple of rows and the last couple of rows have only 2 or 3 columns. So a data-set looks like:

8,0,4091
8,0,
8,0,4091,14454
10,0,4099,14454
2,0,4094,14454
8,-3,4104,14455
3,0,4100,14455
....
....
14,-1,4094,14723
0,3,4105,14723
7,0,4123,14723
7,
6,-2,4096,
3,2,

As you can see, the first two rows and the last three don't have the 4 columns that I expect. When I try importing this file using np.loadtxt(filename, delimiter = ','), I get an error. Once I remove the rows which have fewer than 4 columns (first 2 rows, and last 3 rows, in this case), the import works fine.

Two questions:

  1. Why doesn't the usual importing work. I am not sure what is the exact error in this importing. In other words, why is not having the same number of columns in all rows a problem?

  2. As a workaround, I know how to ignore the first two rows while importing the files with numpy np.loadtxt(filename, skiprows= 2), but is there a simple way to also select a fixed number of rows at the bottom to ignore?

Note: This is NOT about finding unique rows in a numpy array. Its more about importing csv data that are non-uniform in the number of columns that each row contains.

Lorem Ipsum
  • 4,020
  • 4
  • 41
  • 67
deserthiker
  • 57
  • 2
  • 11
  • Possible duplicate of [Find unique rows in numpy.array](http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array) – Joseph Farah Jan 18 '17 at 00:16
  • @JosephFarah This is not about finding unique rows in numpy array. This is about importing csv files with non-uniform structure (rows, columns). I can't even create the numpy array at the moment. – deserthiker Jan 18 '17 at 01:58

2 Answers2

2

Your question is similar (duplicate) to Using genfromtxt to import csv data with missing values in numpy

1) I'm not sure about why this is the default behavior.

  • Could be to warn users that the CSV file might be corrupt.
  • Could be to optimize the array and make it N x M, instead of having multiple column lengths.

2) Use numpy's genfromtext. For this you'll need to know the correct number of columns in advance.

data = numpy.genfromtxt('data.csv', delimiter=',', usecols=[0,1,2,3], invalid_raise=False)

Hope this helps!

Community
  • 1
  • 1
rafaelvalle
  • 6,683
  • 3
  • 34
  • 36
  • This still gives me an error: `ValueError: Some errors were detected ! Line #3 (got 4 columns instead of 3) Line #4 (got 4 columns instead of 3)......` Approach above works. – deserthiker Jan 18 '17 at 02:02
  • Try it now, I forgot to set invalid_raise to False! – rafaelvalle Jan 18 '17 at 02:09
  • Nope still doesn't work @rafaelvalle. Still get `ValueError: Some errors were detected ! Line #3 (got 4 columns instead of 3) Line #4 (got 4 columns instead of 3).....` Interestingly at the very end I get :`[[ 8. 0. 4091.] [ 8. 0. nan] [ 8. 0. 4091.] [ 8. 0. nan]] (4, 3)` – deserthiker Jan 18 '17 at 18:20
  • Please try it again adding the usecols argument. It is a list with the indices of the valid columns. In your example, [0,1,2,3] – rafaelvalle Jan 18 '17 at 18:34
1

You can use genfromtxt, which allows to skip lines a the beginning and at the end:

np.genfromtxt('array.txt', delimiter=',', skip_header=2, skip_footer=3)
Mike Müller
  • 82,630
  • 20
  • 166
  • 161