87

Given this CSV file:

"A","B","C","D","E","F","timestamp"
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291111964948E12
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291113113366E12
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291120650486E12

I simply want to load it as a matrix/ndarray with 3 rows and 7 columns. However, for some reason, all I can get out of numpy is an ndarray with 3 rows (one per line) and no columns.

r = np.genfromtxt(fname,delimiter=',',dtype=None, names=True)
print r
print r.shape

[ (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291111964948.0)
 (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291113113366.0)
 (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291120650486.0)]
(3,)

I can manually iterate and hack it into the shape I want, but this seems silly. I just want to load it as a proper matrix so I can slice it across different dimensions and plot it, just like in matlab.

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
dgorissen
  • 6,207
  • 3
  • 43
  • 52

3 Answers3

167

Pure numpy

numpy.loadtxt(open("test.csv", "rb"), delimiter=",", skiprows=1)

Check out the loadtxt documentation.

You can also use python's csv module:

import csv
import numpy
reader = csv.reader(open("test.csv", "rb"), delimiter=",")
x = list(reader)
result = numpy.array(x).astype("float")

You will have to convert it to your favorite numeric type. I guess you can write the whole thing in one line:

result = numpy.array(list(csv.reader(open("test.csv", "rb"), delimiter=","))).astype("float")

Added Hint:

You could also use pandas.io.parsers.read_csv and get the associated numpy array which can be faster.

Augustin
  • 2,444
  • 23
  • 24
Kaveh_kh
  • 1,852
  • 1
  • 12
  • 14
  • I would add that the skiprows=1 flag is skipping the first row, and is not a standard activation flag if you want to keep all the data. Worked perfectly! – Arturo Jan 02 '17 at 21:09
  • loadtxt does not load also the column names which happen with names=True on genfromtxt – mhstnsc Oct 26 '17 at 12:06
  • Can I ask - is `open` local to that single line? As in, does the file close at the end of the line? – Daniel Soutar Mar 22 '18 at 00:14
  • Yes, it closes the file. See also: https://stackoverflow.com/questions/8011797/open-read-and-close-a-file-in-1-line-of-code – Kaveh_kh Mar 22 '18 at 18:05
  • I would suggest using the seocnd method since `loadtxt` is awfully slow. Alternatively `pandas` is pretty great for the purpose – fireball.1 Feb 13 '19 at 12:02
  • 1
    @fireball.1 speed tests for claims like this would be great for posterity – Akaisteph7 May 30 '20 at 17:27
6

I think using dtype where there is a name row is confusing the routine. Try

>>> r = np.genfromtxt(fname, delimiter=',', names=True)
>>> r
array([[  6.11882430e+02,   9.08956010e+03,   5.13300000e+03,
          8.64075140e+02,   1.71537476e+03,   7.65227770e+02,
          1.29111196e+12],
       [  6.11882430e+02,   9.08956010e+03,   5.13300000e+03,
          8.64075140e+02,   1.71537476e+03,   7.65227770e+02,
          1.29111311e+12],
       [  6.11882430e+02,   9.08956010e+03,   5.13300000e+03,
          8.64075140e+02,   1.71537476e+03,   7.65227770e+02,
          1.29112065e+12]])
>>> r[:,0]    # Slice 0'th column
array([ 611.88243,  611.88243,  611.88243])
mtrw
  • 34,200
  • 7
  • 63
  • 71
  • Interestingly, this does not change the result in my case. I am using Python 2.5 and numpy 1.4.1 so maybe that is the problem – dgorissen Nov 30 '10 at 16:55
  • I'm using Python 2.6 and NumPy 1.3.0! I like the older behavior better. – mtrw Nov 30 '10 at 17:14
4

You can read a CSV file with headers into a NumPy structured array with np.genfromtxt. For example:

import numpy as np

csv_fname = 'file.csv'
with open(csv_fname, 'w') as fp:
    fp.write("""\
"A","B","C","D","E","F","timestamp"
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291111964948E12
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291113113366E12
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291120650486E12
""")

# Read the CSV file into a Numpy record array
r = np.genfromtxt(csv_fname, delimiter=',', names=True, case_sensitive=True)
print(repr(r))

which looks like this:

array([(611.88243, 9089.5601, 5133., 864.07514, 1715.37476, 765.22777, 1.29111196e+12),
       (611.88243, 9089.5601, 5133., 864.07514, 1715.37476, 765.22777, 1.29111311e+12),
       (611.88243, 9089.5601, 5133., 864.07514, 1715.37476, 765.22777, 1.29112065e+12)],
      dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8'), ('D', '<f8'), ('E', '<f8'), ('F', '<f8'), ('timestamp', '<f8')])

You can access a named column like this r['E']:

array([1715.37476, 1715.37476, 1715.37476])

Note: this answer previously used np.recfromcsv to read the data into a NumPy record array. While there was nothing wrong with that method, structured arrays are generally better than record arrays for speed and compatibility.

Mike T
  • 41,085
  • 18
  • 152
  • 203