Reading a big csv file into dataframe

Question

I have a large csv file (of 13 GB) that I wish to read into a dataframe in Python. So I use:

txt = pd.read_csv(r'...file.csv', sep=';', encoding="UTF-8", iterator = True, chunksize=1000)

It works just fine, but the data is contained in a pandas.io.parsers.TextFileReader type, and I want to have it into a dataframe, in order to manipulate the data.

I manage to get a sample of the data, as a dataframe using:

txt.get_chunk(300)

But I would like to have all of the data inside a dataframe. So, I tried:

for df1 in txt:
df.append(df1)

I also tried:

df2 = pd.concat([chunk for chunk in txt])

Didn't work either. Can someone please help me?

Thanks in advance!

do you want to have a whole 13 gb file to a single variable dataframe? — Bharath_Raja, Jan 15 '20 at 17:21
Just get rid of the `chunksize` argument, then `txt` will be a DataFrame. The `chunksize` argument is appropriate when you can't fit everything in memory and instead need to process more manageable parts alone. — ALollz, Jan 15 '20 at 19:40

Bharath_Raja · Answer 1 · 2020-01-16T03:28:17.197

0

You can have a part of data in to a variable using the 'nrows' parameter while reading the file.

txt = pd.read_csv(r'...file.csv', sep=';', encoding="UTF-8", nrows=1000)

However, in such cases you have to prefer using the bigger instance to deal with huge data. You can also use multiple instances by setting up dask.

edited Jan 16 '20 at 03:28

answered Jan 15 '20 at 17:23

Bharath_Raja

622
8
16

This is far worse than the `chunksize` option because to get the next 1000 rows you need to re-read the entire file to find your place. The `chunksize` argument is a lot smarter, essentially giving you a generator that you exhaust. – ALollz Jan 15 '20 at 19:42

score 0 · Answer 2 · answered Jan 15 '20 at 18:23

0

Try to take a look to this answer, in particular dask read_csv could do the trick.

answered Jan 15 '20 at 18:23

Pierluigi

1,048
2
9
16

Reading a big csv file into dataframe

2 Answers2