1

I am new to Python and I attempt to read a large .csv file (with hundreds of thousands or possibly few millions of rows; and about 15.000 columns) using pandas.

What I thought I could do is to create and save each chunk in a new .csv file, iteratively across all chunks. I am currently using a lap top with relatively limited memory (of about 4 Gb, in the process of upgrading it) but I was wondering whether I could do this without changing my set up now. Alternatively, I could transfer this process in a pc with large RAM and attempt larger chunks, but I wanted to get this in place even for shorter row chunks.

I have seen that I can process quickly chunks of data (e.g. 10.000 rows and all columns), using the code below. But due to me being a Python beginner, I have only managed to order the first chunk. I would like to loop iteratively across chunks and save them.

import pandas as pd
import os

print(os.getcwd())
print(os.listdir(os.getcwd()))

chunksize = 10000

data = pd.read_csv('ukb35190.csv', chunksize=chunksize)

df = data.get_chunk(chunksize)

print(df)

export_csv1 = df.to_csv (r'/home/user/PycharmProjects/PROJECT/export_csv_1.csv', index = None, header=True)
Yuca
  • 6,010
  • 3
  • 22
  • 42
Leonardo
  • 119
  • 10
  • Possible duplicate of [How to read a 6 GB csv file with pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) – Yuca Sep 04 '19 at 17:58
  • Possible duplicate of [this question](https://stackoverflow.com/questions/48007017/pandas-split-csv-into-multiple-csvs-or-dataframes-by-a-column) – Aditya Mishra Sep 04 '19 at 17:59
  • Are you doing any processing in pandas or are you just using it for splitting? Does the original file have headers? – Steven Rumbalski Sep 04 '19 at 18:00
  • Thank you Steven. I am just using it for splitting here (in this script). The original file has headers which I found a way to attach in every new .csv file. I still can't figure out a way to save my large .csv file into smaller new .csv files though. Most of the previous questions address how to process but not how to save every step, into a new .csv. Any ideas would be much appreciated. – Leonardo Sep 24 '19 at 11:39

1 Answers1

2

If you are not doing any processing on data then you dont have to even store it in any variable.You can do it directly. PFA code below.Hope this would help u.

import pandas as pd
import os

chunksize = 10000
batch=1

for chunk in pd.read_csv(r'ukb35190.csv',chunksize=chunk_size):
  chunk.to_csv(r'ukb35190.csv'+str(batch_no)+'.csv',index=False)
  batch_no+=1