How do I read CSV data into a record array in NumPy?

Question

Is there a direct way to import the contents of a CSV file into a record array, just like how R's read.table(), read.delim(), and read.csv() import data into R dataframes?

Or should I use csv.reader() and then apply numpy.core.records.fromrecords()?

score 860 · Accepted Answer · edited Jun 13 '22 at 07:56

860

Use numpy.genfromtxt() by setting the delimiter kwarg to a comma:

from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

edited Jun 13 '22 at 07:56

Mateen Ulhaq

24,552
19
101
135

answered Aug 19 '10 at 06:34

Andrew

12,821
2
26
18

17

What if you want something of different types? Like strings and ints? – CGTheLegend Mar 21 '17 at 02:20
15

@CGTheLegend np.genfromtxt('myfile.csv',delimiter=',',dtype=None) – chickensoup Apr 26 '17 at 02:45
5

[numpy.loadtxt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) worked pretty well for me too – Yibo Yang May 19 '17 at 17:34
12

I tried this but I am only getting `nan` values, why? Also with loadtxt, I am getting `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 155: ordinal not in range(128)`. I have umlauts such as ä and ö in the input data. – hhh Jun 18 '17 at 12:00
5

@hhh try adding `encoding="utf8"` argument. Python is one of the few modern software pieces that frequently causes text encoding problems, which feel as things from the past. – kolen Sep 24 '18 at 22:34
1

use skip_header=1 to skip the first line if you're reading from a file with a header. https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt – Liam Nov 22 '19 at 15:20
The problem with this method is that if you have only one row of data, then instead of reading a column list with one item you will get the item directly. Therefore, this solution is not generally usable, but needs to used differently based on the number of the records in the CSV file. – Dávid Natingga Jan 31 '20 at 15:32
@kolen It is a windows problem. On linux encoding='utf8' is by default for quite a long time with the majority of programs, including python. – Antony Hatchkins Dec 31 '20 at 19:15

score 241 · Answer 2 · edited Jul 29 '22 at 07:54

241

Use pandas.read_csv:

import pandas as pd
df = pd.read_csv('myfile.csv', sep=',', header=None)
print(df.values)

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

This gives a pandas DataFrame which provides many useful data manipulation functions which are not directly available with numpy record arrays.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table...

I would also recommend numpy.genfromtxt. However, since the question asks for a record array, as opposed to a normal array, the dtype=None parameter needs to be added to the genfromtxt call:

import numpy as np
np.genfromtxt('myfile.csv', delimiter=',')

For the following 'myfile.csv':

1.0, 2, 3
4, 5.5, 6

the code above gives an array:

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

and

np.genfromtxt('myfile.csv', delimiter=',', dtype=None)

gives a record array:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

This has the advantage that files with multiple data types (including strings) can be easily imported.

edited Jul 29 '22 at 07:54

Mateen Ulhaq

24,552
19
101
135

answered Oct 10 '14 at 09:30

Lee

29,398
28
117
170

2

read_csv works with commas inside quotes. Recommend this over genfromtxt – Viet Apr 06 '16 at 21:37
3

use header=0 to skip the first line in the values, if your file has a 1-line header – c-chavez Jun 30 '17 at 13:34
Bear in mind that this creates a 2d array: e.g. `(1000, 1)`. `np.genfromtxt` does not do that: e.g. `(1000,)`. – Newskooler May 12 '20 at 18:38
1

The OP is asking for Numpy arrays, not about Pandas Dataframe objects. – José L. Patiño Apr 06 '23 at 13:43
@JoséL.Patiño The second part of the question deals with the request for a [Numpy record array](https://numpy.org/doc/stable/user/basics.rec.html). The first part of the answer shows `df.values` which gives a [Numpy representation of the DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html); a convenient method imho. – Lee Apr 06 '23 at 17:36

score 92 · Answer 3 · edited Dec 07 '21 at 05:02

92

I tried it :

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

versus :

import csv
import numpy as np
with open(dest_file,'r') as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = delimiter,
                           quotechar = '"')
    data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.

I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.

edited Dec 07 '21 at 05:02

Omid

33
5

answered Feb 17 '15 at 03:52

William komp

1,237
9
4

30

I tested code similar to this with a csv file containing 2.6 million rows and 8 columns. numpy.recfromcsv() took about 45 seconds, np.asarray(list(csv.reader())) took about 7 seconds, and pandas.read_csv() took about 2 seconds (!). (The file had recently been read from disk in all cases, so it was already in the operating system's file cache.) I think I'll go with pandas. – Matthias Fripp Mar 31 '16 at 21:56
6

I just noticed there are some notes about the design of pandas' fast csv parser at http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/ . The author takes speed and memory requirements pretty seriously. It's also possible to use as_recarray=True to get the result directly as a Python record array rather than a pandas dataframe. – Matthias Fripp Apr 05 '16 at 19:20

score 70 · Answer 4 · edited Oct 26 '20 at 08:49

70

You can also try recfromcsv() which can guess data types and return a properly formatted record array.

edited Oct 26 '20 at 08:49

jkmartindale

523
2
9
22

answered Jan 18 '11 at 12:44

btel

5,563
6
37
47

11

If you want to maintain ordering / column names in the CSV, you can use the following invocation: `numpy.recfromcsv(fname, delimiter=',', filling_values=numpy.nan, case_sensitive=True, deletechars='', replace_space=' ')` The key arguments are the last three. – eacousineau Oct 17 '13 at 14:00

score 26 · Answer 5 · edited Jul 15 '18 at 08:29

As I tried both ways using NumPy and Pandas, using pandas has a lot of advantages:

Faster
Less CPU usage
1/3 RAM usage compared to NumPy genfromtxt

This is my test code:

$ for f in test_pandas.py test_numpy_csv.py ; do  /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps

23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps

test_numpy_csv.py

from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')

test_pandas.py

from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')

Data file:

du -h ~/me/notebook/train.csv
 59M    /home/hvn/me/notebook/train.csv

With NumPy and pandas at versions:

$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2

score 8 · Answer 6 · edited May 24 '18 at 16:42

8

Using numpy.loadtxt

A quite simple method. But it requires all the elements being float (int and so on)

import numpy as np 
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)

edited May 24 '18 at 16:42

Guillaume Jacquenot

11,217
6
43
49

answered Jan 30 '18 at 11:34

Xiaojian Chen

169
1
7

1

Also can use this: ''' data2 = np.genfromtxt(''c:\\1.csv', delimiter=',') ''' – Konstantin F Oct 21 '21 at 19:53

score 7 · Answer 7 · edited Jul 15 '18 at 08:27

7

You can use this code to send CSV file data into an array:

import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)

edited Jul 15 '18 at 08:27

Peter Mortensen

30,738
21
105
131

answered Jun 21 '17 at 07:52

chamzz.dot

607
2
12
24

score 7 · Answer 8 · edited Aug 25 '19 at 18:04

7

This work as a charm...

import csv
with open("data.csv", 'r') as f:
    data = list(csv.reader(f, delimiter=";"))

import numpy as np
data = np.array(data, dtype=np.float)

edited Aug 25 '19 at 18:04

Butiri Dan

1,759
5
12
18

answered Aug 25 '19 at 17:18

Nihal Sargaiya

79
1
3

score 6 · Answer 9 · edited Sep 26 '20 at 12:10

6

This is the easiest way:

import csv
with open('testfile.csv', newline='') as csvfile:
    data = list(csv.reader(csvfile))

Now each entry in data is a record, represented as an array. So you have a 2D array. It saved me so much time.

edited Sep 26 '20 at 12:10

Kermit

4,922
4
42
74

answered Jun 13 '18 at 21:00

matthewpark319

1,214
1
14
16

3

Why should we have to screw around with Pandas, when these tools have so much less feature bloat? – Chris Jan 07 '20 at 00:50

score 6 · Answer 10 · edited Jul 15 '18 at 08:30

I would suggest using tables (pip3 install tables). You can save your .csv file to .h5 using pandas (pip3 install pandas),

import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()

You can then easily, and with less time even for huge amount of data, load your data in a NumPy array.

import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()

# Data in NumPy format
data = data.values

Mokhamad Arfan Wicaksono · Answer 11 · 2021-09-03T15:41:06.143

6

Available on the newest pandas and numpy version.

import pandas as pd
import numpy as np

data = pd.read_csv('data.csv', header=None)

# Discover, visualize, and preprocess data using pandas if needed.

data = data.to_numpy()

edited Sep 03 '21 at 15:41

answered Aug 26 '21 at 03:25

Mokhamad Arfan Wicaksono

315
4
7

score 4 · Answer 12 · edited Aug 12 '17 at 19:45

4

I tried this:

import pandas as p
import numpy as n

closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)

edited Aug 12 '17 at 19:45

Hamid Rouhani

2,309
2
31
45

answered Aug 03 '17 at 08:02

muTheTechie

1,443
17
25

score 0 · Answer 13 · answered Jan 13 '21 at 04:19

0

In [329]: %time my_data = genfromtxt('one.csv', delimiter=',')
CPU times: user 19.8 s, sys: 4.58 s, total: 24.4 s
Wall time: 24.4 s

In [330]: %time df = pd.read_csv("one.csv", skiprows=20)
CPU times: user 1.06 s, sys: 312 ms, total: 1.38 s
Wall time: 1.38 s

answered Jan 13 '21 at 04:19

kdurant

1

1

Please edit the question with some more information about your solution. – Ruli Jan 13 '21 at 08:31

Ovu Sunday · Answer 14 · 2022-08-02T01:28:41.440

-1

this is a very simple task, the best way to do this is as follows

import pandas as pd
import numpy as np


df = pd.read_csv(r'C:\Users\Ron\Desktop\Clients.csv')   #read the file (put 'r' before the path string to address any special characters in the file such as \). Don't forget to put the file name at the end of the path + ".csv"

print(df)`

y = np.array(df)

edited Aug 02 '22 at 01:28

answered Aug 02 '22 at 01:19

Ovu Sunday

9
2

2

The OP asked to read directly to `numpy` array. Reading it as a `dataframe` and converting it to `numpy array` requires more storage and time. – user3503711 Aug 03 '22 at 21:21
Yes, that's correct. But I just gave another possible way of doing the same thing, if the above doesn't work – Ovu Sunday Aug 04 '22 at 11:23

How do I read CSV data into a record array in NumPy?

14 Answers14

test_numpy_csv.py

test_pandas.py

Data file:

Linked

Related