The data file is to big for RAM, so I can't use .read_csv() -> concat -> .to_csv()
. Is there a easy option to concat two DataFrames?
Asked
Active
Viewed 89 times
0

blackraven
- 5,284
- 7
- 19
- 45

Sirdrakeistda
- 23
- 6
-
have you considered using generators? https://stackoverflow.com/questions/18915941/create-a-pandas-dataframe-from-generator – Ade_1 May 30 '21 at 12:25
-
read_csv() has parameters like `iterator` and `chunksize` to help with reading in big files. Check that out – Shubham Periwal May 30 '21 at 12:41
-
If you just need to append files, you can read in individual files and append them all together using `mode="a"` for `to_csv` – Alex May 30 '21 at 12:42
-
1If your dataset is exceeding memory you should try [DASK](https://dask.org/) which allows you to work with large datasets for both data manipulation works well with python libraries like NumPy, scikit-learn, etc. More info: [Dask and pandas: There’s No Such Thing as Too Much Data](https://towardsdatascience.com/dask-and-pandas-theres-no-such-thing-as-too-much-data-284763348459) – DarrylG May 30 '21 at 12:52
1 Answers
0
I have an idea to read a batch of n rows (within RAM limits) from each csv file, and write/append it to a new csv file. Note that all files must have the same column schema.
Below codes seem to work on my small csv files. You could try on larger ones with a larger batch size, and let me know if it works.
filenames = ['file1.csv', 'file2.csv', 'file3.csv']
batch_size = 2
df = pd.read_csv(filenames[0], nrows=0)
df.to_csv('new.csv', index=False) #save the header
for filename in filenames:
this_batch = batch_size
i = 0
while this_batch == batch_size:
df = pd.read_csv(filename, nrows=batch_size, skiprows=batch_size*i)
this_batch = len(df)
i += 1
df.to_csv('new.csv', mode='a', index=False, header=None)

blackraven
- 5,284
- 7
- 19
- 45