1

Help & valuable advise needed from the experts please.

I have 100s of csv files which i would like to merge but table columns aren't same. For example

  • File 1: Header1, Header2, Header3
  • File 2: Header1, Header2
  • Files 3: Header3, Header4
  • Files 4: Header1, Header3, Header4

I would like to merge the data from all of these csv files.

I thought of a logic but I am struggling to implement it in a python code. I think it requires a Dataframe with pre-defined headers. (Header1, Header2, Header3, Header4). Then i should loop through each csv file to search for that header. If the header exists then the data is appended to the dataframe else skip it.

Can someone please advise if there is a function in Python that can make it simple ? I have tried to use read_csv but the data structure should be the same when looping throw the csv files.

import pandas as pd
import os
import glob

# Location of CSV Files
data_location='C:\\Tableau Reports\\ST Database\\Files\\Downloads\\CSV Files\\*.csv'
# This will give path location to all CSV files
csv_files=glob.glob(data_location)

for csv_file in csv_files:
    # Encoding to be changed (UTF-8).
    with open(csv_file, newline='', encoding='cp1252') as csvfile:
        df_file = pd.read_csv(csvfile)

Combined_df = pd.concat(df_file)

print(Combined_df)

I tried to following advise given in this forum but getting error at line 12 df_file = pd.read_csv(csvfile).

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 114934: character maps to

Any help or advise is greatly appreciated. Thank you

1 Answers1

4

EDIT

The following code you posted seems overly complex. You don't have to use open with pd.read_csv. You can directly pass in a file name as a string

for csv_file in csv_files:
    # Encoding to be changed (UTF-8).
    with open(csv_file, newline='', encoding='cp1252') as csvfile:
        df_file = pd.read_csv(csvfile)

The following should work using a list-comprehension

df = pd.concat(pd.read_csv(csv_file) for csv_file in csv_files)

If you're not a fan of list-comprehensions then you can instead start with an empty df and keep concatenating your small dfs onto it

large_df = pd.DataFrame()
for csv_file in csv_files:
    small_df = pd.read_csv(csv_file)
    large_df = pd.concat((large_df,small_df))

ORIGINAL

pd.concat can already do this for you! I've made example data. f1_r1 means "file1, row1" just to show that it's working as expected

import pandas as pd
import io

#The io.StringIO makes these "pretend" .csv files for illustration
#All of this is just to create the example data, you won't do this
file1_name = io.StringIO("""
Header1,Header2,Header3
f1_r1,f1_r1,f1_r1
f1_r2,f1_r2,f1_r2
""")

file2_name = io.StringIO("""
Header1,Header2
f2_r1,f2_r1
f2_r2,f2_r2
""")

file3_name = io.StringIO("""
Header3,Header4
f3_r1,f3_r1
f3_r2,f3_r2
""")

file4_name = io.StringIO("""
Header1,Header3,Header4
f4_r1,f4_r1,f4_r1
f4_r2,f4_r2,f4_r2
""")

#this would be a list of your .csv file names
files = [file1_name, file2_name, file3_name, file4_name]

#pd.concat already handles this use-case for you!
combined_df = pd.concat(pd.read_csv(f_name,encoding='utf8') for f_name in files)
print(combined_df)

enter image description here

mitoRibo
  • 4,468
  • 1
  • 13
  • 22
  • Thank you mate, definitely a good starting point for me. Let me give a go and will give you a feedback, much appreciated – Muhammad Adeel Ahmed Aug 12 '22 at 01:51
  • Code added in main question. Tried to follow your guidelines but getting error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 114934: character maps to – Muhammad Adeel Ahmed Aug 12 '22 at 02:10
  • 1
    ok I've edited the code with the `encoding='utf8')`. This error is unrelated to making one big table out of your smaller tables, instead it's to do with your CSV file encoding. If your file isn't utf8 then you can try different encodings and read the discussion here: https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character – mitoRibo Aug 12 '22 at 02:24
  • 1
    oh, I see in your code you have `# Encoding to be changed (UTF-8)`. Actually I see you have a lot going on in your code, I'll edit my answer – mitoRibo Aug 12 '22 at 02:27
  • Great Got it working, encoding is cp437. I see it is working fine but I don't understand what I was doing wrong in my code (if you can see above). I got the file names (csv files) from the location, created an array of it. Then i looped through each file and read the csv file. After each iteration of for loop, I concactenated the dataframe. But that was giving me error. When I worked with your code, I think it is done in a similar way but one liner. (much more optimised code). – Muhammad Adeel Ahmed Aug 12 '22 at 02:33
  • 1
    One issue with your code is that you are looping through `csv_files` and reading in one of the small tables into `df_file` in each iteration of the loop, but you aren't saving these small files. Each iteration `df_file` gets overwritten and lost. If you take a look at my code that starts with `large_df = pd.DataFrame()` you can see how to add the `small_df` to the `big_df` in each loop – mitoRibo Aug 12 '22 at 02:37
  • 1
    I will now also try with Large & Small Datafrme logic, this is much more clear to me and very well explained in your last post. I think i got your point, something I learnt , many thanks mate – Muhammad Adeel Ahmed Aug 12 '22 at 02:38
  • 1
    Here's a helpful question related to your situation: https://stackoverflow.com/questions/28669482/appending-pandas-dataframes-generated-in-a-for-loop – mitoRibo Aug 12 '22 at 02:39
  • 1
    Many thanks, got it working with Large_DF & Small_DF code too.. great completely understood, thanks mate – Muhammad Adeel Ahmed Aug 12 '22 at 02:45