0

Good morning! I am trying to remove duplicate rows from a csv file with panda. I have 2 files, A.csv and B.csv I want to delete all rows in A that exist in B.

File A.csv:

Pedro,10,rojo
Mirta,15,azul 
Jose,5,violeta

File B.csv:

Pedro,
ignacio,
fernando,
federico,

Output file output.csv:

Mirta,15,azul 
Jose,5,violeta

try to join the files and then apply

cat A.csv B.csv > output.csv

and run this program in python:

import pandas as pd

df = pd.read_csv('output.csv')
df.drop_duplicates(inplace=True)
df.to_csv('final.csv', index=False)
  • Possible duplicate of [https://stackoverflow.com/questions/50103918/how-to-compare-two-pandas-dataframes-and-remove-duplicates-on-one-file-without-a](https://stackoverflow.com/questions/50103918/how-to-compare-two-pandas-dataframes-and-remove-duplicates-on-one-file-without-a) –  Jan 30 '20 at 14:05

2 Answers2

0
  1. Import both A.csv and B.csv that you have df_a and df_b.

  2. Then you append both dataframe that you have df_ab:

    df_ab = df_a.append(df_b)

  3. then you look for the duplicated rows:

    df_ab_only_duplicated_values = df_ab[df_ab.duplicated()]

  4. You append df_a with df_ab_only_duplicated_values and drop all duplicated rows:

    df_a = df_a.append(df_ab_only_duplicated_values).drop_duplicates(keep=False)

Community
  • 1
  • 1
Vinz
  • 121
  • 2
  • 7
0
data_A = {'name': ['Pedro','Mirta','Jose'], 'age': [10,15,5]}
data_B = {'name': ['Pedro','ignacio','fernando', 'federico']}

A = pd.DataFrame(data=data_A)
B = pd.DataFrame(data=data_B)
A

    name    age
0   Pedro   10
1   Mirta   15
2   Jose    5

B

    name
0   Pedro
1   ignacio
2   fernando
3   federico

df = pd.merge(A, B, on='name', how='left', indicator=True)
df
    name    age _merge
0   Pedro   10  both
1   Mirta   15  left_only
2   Jose    5   left_only
df  = df[df['_merge'] != 'both']
df

    name    age _merge
1   Mirta   15  left_only
2   Jose    5   left_only
df.drop(columns='_merge', inplace=True)
df
    name    age
1   Mirta   15
2   Jose    5
alanindublin
  • 101
  • 6