0

I have a large csv file (of 13 GB) that I wish to read into a dataframe in Python. So I use:

txt = pd.read_csv(r'...file.csv', sep=';', encoding="UTF-8", iterator = True, chunksize=1000)

It works just fine, but the data is contained in a pandas.io.parsers.TextFileReader type, and I want to have it into a dataframe, in order to manipulate the data.

I manage to get a sample of the data, as a dataframe using:

txt.get_chunk(300)

But I would like to have all of the data inside a dataframe. So, I tried:

for df1 in txt:
df.append(df1)

I also tried:

df2 = pd.concat([chunk for chunk in txt])

Didn't work either. Can someone please help me?

Thanks in advance!

Siva Kg
  • 59
  • 8
  • do you want to have a whole 13 gb file to a single variable dataframe? – Bharath_Raja Jan 15 '20 at 17:21
  • Just get rid of the `chunksize` argument, then `txt` will be a DataFrame. The `chunksize` argument is appropriate when you can't fit everything in memory and instead need to process more manageable parts alone. – ALollz Jan 15 '20 at 19:40

2 Answers2

0

You can have a part of data in to a variable using the 'nrows' parameter while reading the file.

txt = pd.read_csv(r'...file.csv', sep=';', encoding="UTF-8", nrows=1000)

However, in such cases you have to prefer using the bigger instance to deal with huge data. You can also use multiple instances by setting up dask.

Bharath_Raja
  • 622
  • 8
  • 16
  • This is far worse than the `chunksize` option because to get the next 1000 rows you need to re-read the entire file to find your place. The `chunksize` argument is a lot smarter, essentially giving you a generator that you exhaust. – ALollz Jan 15 '20 at 19:42
0

Try to take a look to this answer, in particular dask read_csv could do the trick.

Pierluigi
  • 1,048
  • 2
  • 9
  • 16