0

I have multiple CSV files which I want to concat. The files may or may not have the same columns. If columns are not the same, the data in different columns should be empty.

Example csv file1 has this data.

name, age
abc,  21
dsd,  20

csv file2 has this data

height, pay
5'5,    2000
5'10,   1000

Now I want to merge these two. For Example, they have millions of rows and I can't read them entirely. I want output in this style

name, age, height, pay
abc,  21,  None,  None
dsd, 20,   None,  None
   ,       5'5,   2000
   ,       5'10,  1000

Note that columns have null values in them if they don't have data. I have this code but it isn't working. Appending column names with each chunk and not adding empty columns.

import pandas as pd
def common_member(a, b):
    a_set = set(a)
    b_set = set(b)
    if (a_set & b_set):
        return True 
    else:
        return False
all_files = ["/home/shahid/work-folders/Ai-Studio-input/mixed_csv/ludwig_train.csv", "/home/shahid/work-folders/Ai-Studio-input/mixed_csv/ludwig_train (copy).csv"]
merged_columns = []
for file in all_files:
    df = pd.read_csv(open(file), float_precision='round_trip', chunksize=1000)
    for chunk in df:
        stop = True
        if stop ==True:
            merged_columns+= list(chunk.columns)
            break
        stop= False
print(set(merged_columns))
for file in all_files:
    df = pd.read_csv(open(file), float_precision='round_trip', chunksize=1000)
    for chunk in df:
        if list(chunk.columns) == set(merged_columns):
            chunk.to_csv("/home/shahid/work-folders/Ai-Studio-input/mixed_csv/merged.csv", mode='a', index=False)
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
shahid hamdam
  • 751
  • 1
  • 10
  • 24
  • Just do: `df = pd.concat([df1, df2])`. – Mayank Porwal Apr 07 '21 at 06:10
  • First I need to read both files in memory which Is not possible for me. I want to do it chunk by chunk. Second there can be 50 csv files or 100. Not a possible solution. – shahid hamdam Apr 07 '21 at 06:11
  • @MayankPorwal If I could read all files in memory, then I wouldn't be asking questions here. – shahid hamdam Apr 07 '21 at 06:13
  • If the data won't fit in memory, what are you planning to do with the combined data? Flush it to disk and start over? If so, then the `concat` solution certainly will work for you. – Tim Roberts Apr 07 '21 at 06:16
  • You do know you can concatenate CSV files one line at a time without involving pandas at all, right? They're just strings in text files. – Tim Roberts Apr 07 '21 at 06:19
  • @TimRoberts Answer to your first comment, I want to write csv file using df.to_csv() in append mode, it will write file on disk and not memory, the data will not actully get combined at any point before written to csv file. to answer your 2nd question, yes I can write csv file line by line but how I can handle missing data due to different columns in all files? – shahid hamdam Apr 07 '21 at 06:34

0 Answers0