0

The data file is to big for RAM, so I can't use .read_csv() -> concat -> .to_csv(). Is there a easy option to concat two DataFrames?

blackraven
  • 5,284
  • 7
  • 19
  • 45
  • have you considered using generators? https://stackoverflow.com/questions/18915941/create-a-pandas-dataframe-from-generator – Ade_1 May 30 '21 at 12:25
  • read_csv() has parameters like `iterator` and `chunksize` to help with reading in big files. Check that out – Shubham Periwal May 30 '21 at 12:41
  • If you just need to append files, you can read in individual files and append them all together using `mode="a"` for `to_csv` – Alex May 30 '21 at 12:42
  • 1
    If your dataset is exceeding memory you should try [DASK](https://dask.org/) which allows you to work with large datasets for both data manipulation works well with python libraries like NumPy, scikit-learn, etc. More info: [Dask and pandas: There’s No Such Thing as Too Much Data](https://towardsdatascience.com/dask-and-pandas-theres-no-such-thing-as-too-much-data-284763348459) – DarrylG May 30 '21 at 12:52

1 Answers1

0

I have an idea to read a batch of n rows (within RAM limits) from each csv file, and write/append it to a new csv file. Note that all files must have the same column schema.

Below codes seem to work on my small csv files. You could try on larger ones with a larger batch size, and let me know if it works.

filenames = ['file1.csv', 'file2.csv', 'file3.csv']
batch_size = 2
df = pd.read_csv(filenames[0], nrows=0)
df.to_csv('new.csv', index=False)   #save the header

for filename in filenames:
    this_batch = batch_size
    i = 0
    while this_batch == batch_size:
        df = pd.read_csv(filename, nrows=batch_size, skiprows=batch_size*i)
        this_batch = len(df)
        i += 1
        df.to_csv('new.csv', mode='a', index=False, header=None)
blackraven
  • 5,284
  • 7
  • 19
  • 45