0

I have two csv files and I am trying to combine them in one based on a condition.

they both have the 'Import_ID' column which gives unique id for each row. What I need to do is to check if the Import_ID from df1 already exist in df2, if not add this row, else skip it. It would be best if it would be done without the loops or iterrows().

df1
 Import_ID            State                  ProductName
0     7027514279_100    716        TRANSCEND ESD220C 240GB SSD extern
1   7027514279_200.1    716        INTELLINET 48,3cm 19Zoll Steckdosenleist
2   7027514279_200.2    716        INTELLINET 48,3cm 19Zoll Steckdosenleist
3   7027514279_200.3    716        INTELLINET 48,3cm 19Zoll Steckdosenleist
4   7027514279_200.4    716        INTELLINET 48,3cm 19Zoll Steckdosenleist

and df2
   Import_ID          State                  ProductName
0     7027329459_100    716        BROTHER MFCJ6530DW MULTI-FUNCTION (AT)
1   7027514279_200.2    716        INTELLINET 48,3cm 19Zoll Steckdosenleist
2   7027514279_200.3    716        INTELLINET 48,3cm 19Zoll Steckdosenleist
3     7027329459_400    716        BROTHER LC-3219XLY Tinte Gelb (1500 S)
4     7027329459_500  20010        BROTHER LC-3219XLBK Tinte Schwarz

what I would like it to be is that df1 rows are only then imported when their Import_ID doesn't exist in the df2

df2 updated:
 Import_ID          State                  ProductName
0     7027329459_100    716        BROTHER MFCJ6530DW MULTI-FUNCTION (AT)
1   7027514279_200.2    716        INTELLINET 48,3cm 19Zoll Steckdosenleist
2   7027514279_200.3    716        INTELLINET 48,3cm 19Zoll Steckdosenleist
3     7027329459_400    716        BROTHER LC-3219XLY Tinte Gelb (1500 S)
4     7027329459_500  20010        BROTHER LC-3219XLBK Tinte Schwarz
5     7027514279_100    716        TRANSCEND ESD220C 240GB SSD extern
6   7027514279_200.1    716        INTELLINET 48,3cm 19Zoll Steckdosenleist
7   7027514279_200.4    716        INTELLINET 48,3cm 19Zoll Steckdosenleist

so in the example the df1 rows 1 nad 2 are not copied, because they already exist.

I tried by creating a temp file where I would store all of the rows from df1 that don't exist in df2 but it didn't work.

temp = numpy.where(file_in['Import_ID'] not in finalData['Import_ID'], file_in)

I don't know what else I should try , since I am pretty new to python.

miwa_p
  • 435
  • 1
  • 4
  • 19
  • 1
    concat both then drop duplicates - `pd.concat([df2,df1]).drop_duplicates().reset_index(drop=True)` ? – anky Aug 22 '19 at 14:22
  • 2
    @anky_91 if df2 is all unique row , that is fine , if it is not , we will drop partial information from df2 – BENY Aug 22 '19 at 14:28

1 Answers1

3

Using merge with indicator more info

s=df1.merge(df2,how='left',indicator=True)
pd.concat([df2,s.loc[s['_merge']=='left_only'].drop('_merge',axis=1)])
BENY
  • 317,841
  • 20
  • 164
  • 234
  • when I try your solution it gives me this error: ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat – miwa_p Aug 22 '19 at 14:37
  • @AmilaBečirović Please check your df dtype , to see same columns name have the same datatype – BENY Aug 22 '19 at 14:39
  • strange, few of them don't have the same dtype, although the other data frame was created in the same way as the first one :/ – miwa_p Aug 22 '19 at 14:51
  • @AmilaBečirovića you need to convert it to the same dtype :-) – BENY Aug 22 '19 at 14:53
  • I changed dtype, but now it only shows the rows from df1 and the rows from df2 if different are no where to be seen – miwa_p Aug 22 '19 at 15:28
  • I did it like this and it works apparently : `s = pd.concat([df1,df2], ignore_index=True) s = s.sort_values('Import_ID', ascending=False) s = s.drop_duplicates(subset='Import_ID', keep='first')` – miwa_p Aug 22 '19 at 16:00