1

I have a large csv file ~90k rows and 355 columns. The first 354 columns correspond to the presence of different words, showing a 1 or 0 and the last column to a numerical value.

Eg:

table, box, cups, glasses, total
1,0,0,1,30
0,1,1,1,28
1,1,0,1,55

When I use:

d = np.recfromcsv('clean.csv', dtype=None, delimiter=',', names=True)
d.shape
# I get: (89460,)

So my question is:

  1. How do I get a 2d array/matrix? Does it matter?
  2. How can I separate the 'total' column so I can create train, cross_validation and test sets and train a model?
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
holografix
  • 610
  • 3
  • 10
  • 24

2 Answers2

3

np.recfromcsv returns a 1-dimensional record array.

When you have a structured array, you can access the columns by field title:

d['total']

returns the totals column.

You can access rows using integer indexing:

d[0]

returns the first row, for example.


If you wish to select all the columns except the last row, then you'd be better off using a 2D plain NumPy array. With a plain NumPy array (as opposed to a structured array) you can select all the rows except the last on using integer indexing:

You could use np.genfromtxt to load the data into a 2D array:

import numpy as np

d = np.genfromtxt('data', dtype=None, delimiter=',', skiprows=1)
print(d.shape)
# (3, 5)
print(d)
# [[ 1  0  0  1 30]
#  [ 0  1  1  1 28]
#  [ 1  1  0  1 55]]

This select the last column:

print(d[:,-1])
# [30 28 55]

This select everything but the last column:

print(d[:,:-1])
# [[1 0 0 1]
#  [0 1 1 1]
#  [1 1 0 1]]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks mate I've got that far. The problem is, how do I get all the other columns apart from the last column? – holografix Jan 25 '14 at 01:01
  • Nitpick: All record arrays *returned by `recfromcsv`* are 1-dimensional. They are not all 1-d in general. – Warren Weckesser Jan 25 '14 at 01:03
  • No idea what you meant by that Warren. Can't believe it's this difficult to select a range of columns in numpy! Can't I do something like X = d[:,0:3]; Y = d[:,4] ?! – holografix Jan 25 '14 at 02:00
  • http://stackoverflow.com/questions/16178956/how-can-i-use-numpy-array-indexing-to-select-2-columns-out-of-a-2d-array-to-sele – holografix Jan 25 '14 at 02:11
0

Ok after much googling and time wasting this is what anyone trying to get numpy out of the way so they can read a CSV and get on with Scikit Learn needs to do:

# Say your csv file has 10 columns, 1-9 are features and 10 
# is the Y you're trying to predict.
cols = range(0,10)
X = np.loadtxt('data.csv', delimiter=',', dtype=float, usecols=cols, ndmin=2, skiprows=1)
Y = np.loadtxt('data.csv', delimiter=',', dtype=float, usecols=(9,), ndmin=2, skiprows=1)
# note how for Y the usecols argument only takes a sequence, 
# even though I only want 1 column I have to give it a sequence.
holografix
  • 610
  • 3
  • 10
  • 24