Pandas to csv method in append mode with dataframes with different columns

Question

I have a very large number of csv files that I need to merge into a single one. I can't make a list and concatenate later because of memory restrictions even tho I have 64 GB of RAM.

To avoid saving everything into memory, I'm streaming the data into a file using:

entidad_csv = folder_entidad / f"{entidad.name.lower()}.csv"
for f in tqdm(files):
    df = descomprime(f)
    df.to_csv(entidad_csv, index=False, mode="a", header=not entidad_csv.exists())

But I'm having problems when a column is missing in one of the files because it appends the rows as is. I don't know before hand which columns are present in each file so I need the merged file to have every column.

Thanks in advance

Do all CSV files have a header row that matches the columns in each file? — ken, May 27 '23 at 04:36
Yes, every csv has a column header, a set of columns is the same for all files. — marco, May 27 '23 at 05:59
I'm confused. What does "a set of columns is the same for all files" mean? You said "a column is missing in one of the files", what about the header of that file? — ken, May 27 '23 at 09:39

score 1 · Answer 1 · answered May 27 '23 at 04:38

1

You can try making a list of columns based on files read and make it a blank dataframe with header only and then use the blank dataframe to concat with new csv file. It will make a new dataframe with all columns from files.

answered May 27 '23 at 04:38

Dinh Quang Tuan

468
3
10

Timeless · Accepted Answer · 2023-05-27T06:32:10.160

IIUC, you can use the logic below that uses :

Index.union : to form a unique combination between the headers
insert_line_front : to replace/overwrite the header at each iteration
reindex : to produce NaN values below the missing col(s)

output_file = Path("output_folder") / "merged.csv"

for file_path in tqdm(Path("input_files").glob("*.csv")):
    df = pd.read_csv(file_path)
    if not output_file.exists():
        df.to_csv(output_file, index=False) # header=True
    else:
        with open(output_file, "r") as file:
            old_header = file.readline().strip()
        if old_header:
            new_header = pd.Index(old_header.split(",")).union(df.columns, sort=False)
            #https://stackoverflow.com/a/48734140/16120011 © Stephen Rauch
            insert_line_front(output_file, new_header.str.cat(sep=","))
            df.reindex(new_header, axis=1).to_csv(
                output_file, index=False, header=False, mode="a"
            )

Test/Output :

file1.csv:

col1
foo
bar
qux

file2.csv:

col3,col4
uu,3
ww,4

file3.csv:

col5,col2,col1
ii,5,baz
jj,6,qux

merged.csv (in tabular format) :

col1	col3	col4	col5	col2
foo
bar
qux
	uu	3.0
	ww	4.0
baz			ii	5.0
qux			jj	6.0

Tree view :

input_files
┣━━ file1.csv
┣━━ file1.csv
┗━━ file3.csv
output_folder
┗━━ merged.csv

Nice! logic is sound and this might be a solution but its giving me a UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4935: character maps to that Im trying to figure out — marco, May 27 '23 at 06:47
In which line the error is thrown ? Anyways, you might play with the encodings to get the right one, check [this](https://stackoverflow.com/a/65596858/16120011) answer that addresses a similar error.. — Timeless, May 27 '23 at 07:02
I figured it out. What if I want to avoid appending repeated rows? — marco, May 29 '23 at 02:51
I don't think this would be possible without reading the whole previous merged file. — Timeless, May 29 '23 at 05:51

Pandas to csv method in append mode with dataframes with different columns

2 Answers2