1

I have a very large number of csv files that I need to merge into a single one. I can't make a list and concatenate later because of memory restrictions even tho I have 64 GB of RAM.

To avoid saving everything into memory, I'm streaming the data into a file using:

entidad_csv = folder_entidad / f"{entidad.name.lower()}.csv"
for f in tqdm(files):
    df = descomprime(f)
    df.to_csv(entidad_csv, index=False, mode="a", header=not entidad_csv.exists())

But I'm having problems when a column is missing in one of the files because it appends the rows as is. I don't know before hand which columns are present in each file so I need the merged file to have every column.

Thanks in advance

marco
  • 107
  • 1
  • 12

2 Answers2

1

You can try making a list of columns based on files read and make it a blank dataframe with header only and then use the blank dataframe to concat with new csv file. It will make a new dataframe with all columns from files.

Dinh Quang Tuan
  • 468
  • 3
  • 10
1

IIUC, you can use the logic below that uses :

  • Index.union : to form a unique combination between the headers
  • insert_line_front : to replace/overwrite the header at each iteration
  • reindex : to produce NaN values below the missing col(s)
output_file = Path("output_folder") / "merged.csv"

for file_path in tqdm(Path("input_files").glob("*.csv")):
    df = pd.read_csv(file_path)
    if not output_file.exists():
        df.to_csv(output_file, index=False) # header=True
    else:
        with open(output_file, "r") as file:
            old_header = file.readline().strip()
        if old_header:
            new_header = pd.Index(old_header.split(",")).union(df.columns, sort=False)
            #https://stackoverflow.com/a/48734140/16120011 © Stephen Rauch
            insert_line_front(output_file, new_header.str.cat(sep=","))
            df.reindex(new_header, axis=1).to_csv(
                output_file, index=False, header=False, mode="a"
            )

Test/Output :

file1.csv:

col1
foo
bar
qux

file2.csv:

col3,col4
uu,3
ww,4

file3.csv:

col5,col2,col1
ii,5,baz
jj,6,qux

merged.csv (in tabular format) :

col1 col3 col4 col5 col2
foo
bar
qux
uu 3.0
ww 4.0
baz ii 5.0
qux jj 6.0

Tree view :

input_files
┣━━ file1.csv
┣━━ file1.csv
┗━━ file3.csv
output_folder
┗━━ merged.csv
Timeless
  • 22,580
  • 4
  • 12
  • 30
  • Nice! logic is sound and this might be a solution but its giving me a UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4935: character maps to that Im trying to figure out – marco May 27 '23 at 06:47
  • In which line the error is thrown ? Anyways, you might play with the encodings to get the right one, check [this](https://stackoverflow.com/a/65596858/16120011) answer that addresses a similar error.. – Timeless May 27 '23 at 07:02
  • I figured it out. What if I want to avoid appending repeated rows? – marco May 29 '23 at 02:51
  • 1
    I don't think this would be possible without reading the whole previous merged file. – Timeless May 29 '23 at 05:51