0

So i'm new to python, I want to discover it's potential and get more knowledge about what I can do with it. I did this code to compare CSVs, basically what it does you provide it 2 CSVs, CSV1 that has some id column and a column with values that you want to add to other CSV (CSV2)

Note: This script does exactly what I want, and it seems to work fine, hope it can also be useful to someone, my question is really what I can do to improve it's performance or even make the code cleaner

# Made by Varqas
# CSV1 = CSV containing values that can be matched in CSV2 and a column that will be added
# CSV2 = CSV containing values that can be matched and column that will be concatenated at the end of the CSV (The last column values should be empty)

with open('csv1.csv', encoding="utf8") as check_file:
    # Get Column that will be used to Compare values and add it to a list
    columnToCompare = list([row.split(',')[0].strip() for row in check_file])

with open('csv1.csv', encoding="utf8") as check_file:
    # Get Column that will be used to add to a row values and add it to a list
    columnToAdd = list([row.split(',')[2].strip() for row in check_file])

with open('csv2.csv', 'r', encoding="utf8") as in_file, open('out.csv', 'w', encoding="utf8") as out_file:
    i = 0
    # For each Row in CSV2
    for line in in_file:
        # Write Headers
        if i == 0:
            out_file.write(line)
        else:
            # GET Column on CSV2 containing value that will be compared on CVS1
            value = line.split(',')[1].strip()
            # Check if first Column value on CSV2  either variable is in 
            if value in columnToCompare:
                # Check for duplicates in the list 
                numberOfOccurences = list(columnToCompare).count(value)
                concatRow = ""
                if numberOfOccurences > 1:
                    # Concatenate all values of occurences
                    for x in range(numberOfOccurences):
                        index = list(columnToCompare).index(value)
                        concatRow = concatRow + columnToAdd[index]
                        if x != numberOfOccurences - 1:
                            concatRow = concatRow + " + "
                        # Remove value so list.index doesn't found same row
                        columnToCompare[index] = ""
                else:
                    # Add other row that doesn't match
                    index = list(columnToCompare).index(value)
                    concatRow = columnToAdd[index]

                # Concat to last column of CSV2
                out_file.write(line.strip() + concatRow + "\n")
            else:
                # Still concat value in CSV2 to last column if not found in csv1 
                out_file.write(line.strip() + "not found" + "\n")
        i = i + 1

I know it can be improved, and minified perhaps using some libs... Let me know what you think!

I tried using pd merge, but I didn't understand quite well how I could add concatenation and values inside it.

Varqas
  • 17
  • 4

1 Answers1

1

You can use the Pandas library to read the two CSV files into dataframes and merge the two columns into the second CSV and output a new CSV with the merged columns.

import pandas as pd

# read first CSV
df1 = pd.read_csv('first.csv')

# read second CSV
df2 = pd.read_csv('second.csv')

# merge the id column and a "column with values 
# that you want to add to other CSV (CSV2)"
# for the example the second column is named 'data'.
merged_df = pd.merge(df2, df1[['id', 'data']], on='id', how='left')

# save new dataframe to csv.
merged_df.to_csv('merged.csv', index=False)
Captain Caveman
  • 1,448
  • 1
  • 11
  • 24
  • The thing is, that on CSV2 there is also a Column named 'data' column, therefore when I tested this now, it created 2 new Columns (data_x, data_y) EDIT: Nevermind, they had values in it, I guess this works, although doesn't fill other rows with "not found" and it's not accepting utf-8 – Varqas May 12 '23 at 13:44
  • You can save-as your files in utf-8 encoding or try to pass the encoding='utf-8' parameter when you read the CSVs into dataframes. You can insert 'not found' like this: df['column_name'] = df['column_name'].fillna('not found'). – Captain Caveman May 12 '23 at 14:03