-1

I have two dataframes and want to compare them and delete the days in the df2 which are not the same as in df1. I tried to use:

df2[~df2.Date.isin(df1.Date)]

but this does not work and getting an empty dataframe. df2 should look like df1. The dataframe's looks like the following:

df1
        Date
0    20-12-16
1    21-12-16
2    22-12-16
3    23-12-16
4    27-12-16
5    28-12-16
6    29-12-16
7    30-12-16
8    02-01-17
9    03-01-17
10   04-01-17
11   05-01-17
12   06-01-17

df2

         Date
0    20-12-16
1    21-12-16
2    22-12-16
3    23-12-16
4    24-12-16
5    25-12-16
6    26-12-16
7    27-12-16
8    28-12-16
9    29-12-16
10   30-12-16
11   31-12-16
12   01-01-17
13   02-01-17
14   03-01-17
15   04-01-17
16   05-01-17
17   06-01-17
MCM
  • 1,479
  • 2
  • 17
  • 22

2 Answers2

3

It seems dtypes are different. For comparing need same.

Check it by:

print (df1.Date.dtype)
print (df2.Date.dtype)

and then convert if necessary:

df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

I add another 2 solutions - first with numpy.in1d and second with merge, because need default inner join:

 df = df2[np.in1d(df2.Date, df1.Date)]
print (df)
         Date
0  2016-12-20
1  2016-12-21
2  2016-12-22
3  2016-12-23
7  2016-12-27
8  2016-12-28
9  2016-12-29
10 2016-12-30
13 2017-01-02
14 2017-01-03
15 2017-01-04
16 2017-01-05
17 2017-01-06

df = df1.merge(df2, on='Date')
print (df)
         Date
0  2016-12-20
1  2016-12-21
2  2016-12-22
3  2016-12-23
7  2016-12-27
8  2016-12-28
9  2016-12-29
10 2016-12-30
13 2017-01-02
14 2017-01-03
15 2017-01-04
16 2017-01-05
17 2017-01-06

Sample:

d1 = {'Date': ['20-12-16', '21-12-16', '22-12-16', '23-12-16', '27-12-16', '28-12-16', '29-12-16', '30-12-16', '02-01-17', '03-01-17', '04-01-17', '05-01-17', '06-01-17']}
d2 = {'Date': ['20-12-16', '21-12-16', '22-12-16', '23-12-16', '24-12-16', '25-12-16', '26-12-16', '27-12-16', '28-12-16', '29-12-16', '30-12-16', '31-12-16', '01-01-17', '02-01-17', '03-01-17', '04-01-17', '05-01-17', '06-01-17']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

print (df1.Date.dtype)
object

print (df2.Date.dtype)
object

df1['Date'] = pd.to_datetime(df1['Date'], format='%d-%m-%y')
df2['Date'] = pd.to_datetime(df2['Date'], format='%d-%m-%y')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks a lot for the answer. when using print I receive an object, convertion helps to convert it into datetime64[ns]. when trying to compare the dataframes i still receive an empty dataframe. Furthermore I want the dates which are equal not the non-equal ones. – MCM Jul 04 '17 at 10:29
  • thanks a lot. it seems to work but the output which I receive should be the same dates as df1 when comparing df1 and df2. Is it clear? I am not sure what Iam doing wrong or misunderstand? – MCM Jul 04 '17 at 10:57
  • @MCM - Problem is with your sample too? What is your pandas version? `print (pd.show_versions())`. – jezrael Jul 04 '17 at 10:59
  • the version I used is pandas: 0.20.2. Why are you sking? – MCM Jul 04 '17 at 11:12
  • Hmmm, I have same. Problem is with my sample data too? – jezrael Jul 04 '17 at 11:13
  • I can reproduce the example above from you. But the output should be as the following: 0 20-12-16 1 21-12-16 2 22-12-16 3 23-12-16 4 27-12-16 5 28-12-16 6 29-12-16 7 30-12-16 8 02-01-17 9 03-01-17 10 04-01-17 11 05-01-17 12 06-01-17 – MCM Jul 04 '17 at 11:15
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/148308/discussion-between-mcm-and-jezrael). – MCM Jul 04 '17 at 11:16
  • I check another answer - do you need filter only same dates which are in both dataframes? – jezrael Jul 04 '17 at 11:21
  • yes. Comparing two dataframes df1 and df2 and delete the non euqual dates from df2 and put that into a new dataframe. Is it clearer? – MCM Jul 04 '17 at 11:22
  • So sorry, need remove `~` and use another answer. – jezrael Jul 04 '17 at 11:23
  • I decide to change my solution to add another 2 with same output as `df2[df2.Date.isin(df1.Date)]` – jezrael Jul 04 '17 at 11:38
  • Thanks this small piece :). It seems to work. The only problem is that dates somehow converted wrongly: Date 0 2016-12-20 1 2016-12-21 2 2016-12-22 3 2016-12-23 7 2016-12-27 8 2016-12-28 9 2016-12-29 10 2016-12-30 13 2017-02-01 14 2017-03-01 15 2017-04-01 16 2017-05-01 17 2017-06-01 look starting from jan, feb etc – MCM Jul 04 '17 at 11:40
  • What does mean wrongly? I thought days and months are swapped, but it seems correct. – jezrael Jul 04 '17 at 11:43
  • yes this is the problem, but why it get swapped? In the other two example it is the same. How can that be prevented? – MCM Jul 04 '17 at 11:44
  • You can specify parameter format `df2['Date'] = pd.to_datetime(df2['Date'], format='%d-%m-%y')` – jezrael Jul 04 '17 at 11:46
  • I t it but seems not to work. I tried it before. Mybe just making something wrong. Can you update the example above? – MCM Jul 04 '17 at 11:49
  • Give me a second. – jezrael Jul 04 '17 at 12:02
  • OK, I do it. And I get also same data as df1, so it means `df1.Date` are subset of `df2.Date`. I think it is correct. – jezrael Jul 04 '17 at 12:05
  • what do you mean? So I receive the following: The code looks like the following: df1 = pd.DataFrame(d1) df2 = pd.DataFrame(d2) df2['Date'] = pd.to_datetime(df2['Date'], format='%d-%m-%y') df = df2[np.in1d(df2.Date, df1.Date)] Output: Date 0 2016-12-20 1 2016-12-21 2 2016-12-22 3 2016-12-23 7 2016-12-27 8 2016-12-28 9 2016-12-29 10 2016-12-30 13 2017-02-01 14 2017-03-01 15 2017-04-01 16 2017-05-01 17 2017-06-01 which the january dates should be: 13 2017-01-02 14 2017-01-03 15 2017-01-04 16 2017-01-05 17 2017-01-06 is it clearer? – MCM Jul 04 '17 at 12:08
  • It seems all is OK. You need filter `df2` and as output dates which are in df1 and too in df2. So your output show you have all data frm df1 also in df2, so output is same as df1. – jezrael Jul 04 '17 at 12:11
  • yes the output is right. But I dont understand why tthe dates get cverted wrongly? – MCM Jul 04 '17 at 12:13
  • Do you think without parameter format? Answer is pandas try guess pattern of datetime and something it happens. – jezrael Jul 04 '17 at 12:14
  • what do you mean? I used: df2['Date'] = pd.to_datetime(df2['Date'], format='%d-%m-%y') and then do the comparison: df = df2[np.in1d(df2.Date, df1.Date)]. Do I miss something? – MCM Jul 04 '17 at 12:16
  • No, I think if not define parameter `format` in `to_datetime` pandas try find pattern of datetime. But now it is all OK. I am really happy it works for you ;) – jezrael Jul 04 '17 at 12:18
  • actually it does not work :(. Also by defining the format. Thats why I am confused – MCM Jul 04 '17 at 12:19
  • Hmmm, what does not work? Do you have different output? – jezrael Jul 04 '17 at 12:19
  • If I compare January I receive the following: Date 13 2017-02-01 14 2017-03-01 15 2017-04-01 16 2017-05-01 17 2017-06-01 But it should be: Date 13 2017-01-02 14 2017-01-03 15 2017-01-04 16 2017-01-05 17 2017-01-06 .Do you understand what I mean? – MCM Jul 04 '17 at 12:22
  • Hmmm, what is `print (df2.Date.dtype)` and `print (df1.Date.dtype)` before converting to datetime? both are objects? – jezrael Jul 04 '17 at 12:26
  • both are datetime64[ns]. The full code looks like the following: d1 = {'Date': pd.to_datetime(['20-12-16', '21-12-16,...])} d2 = {'Date': pd.to_datetime(['20-12-16', '21-12-16', '22-12-16',...])} df1 = pd.DataFrame(d1) df2 = pd.DataFrame(d2) print(df1.Date.dtype) print(df2.Date.dtype) df1['Date'] = pd.to_datetime(df1['Date'], format='%d-%m-%y') df2['Date'] = pd.to_datetime(df2['Date'], format='%d-%m-%y') df = df2[np.in1d(df2.Date, df1.Date)] print(df) – MCM Jul 04 '17 at 12:28
  • I think I see problem – jezrael Jul 04 '17 at 12:30
  • need `d1 = {'Date': ['20-12-16', '21-12-16,...]} d2 = {'Date': ['20-12-16', '21-12-16', '22-12-16',...]}` not `d1 = {'Date': pd.to_datetime(['20-12-16', '21-12-16,...])} d2 = {'Date': pd.to_datetime(['20-12-16', '21-12-16', '22-12-16',...])}`, because you double convert to datetimes one column. – jezrael Jul 04 '17 at 12:31
  • Yes you are right! Now it seems to work. Thanks a lot. – MCM Jul 04 '17 at 12:33
0

Your mistake is from logic. You want to select the df2 date that are df1. So you should write

df2[df2.Date.isin(df1.Date)]

not the contrary of the boolean where comparison/inclusion in df1 is true

You could also obtain the same result with

set(b.Date)-(set(b.Date)-set(a.Date))

Which should then be used through:

pd.DataFrame(sorted((set(b.Date)-(set(b.Date)-set(a.Date)))), columns=["Date"] )   

While the sorting is not optimal and you may change it in pandas by better logic.

 df = pd.DataFrame(list((set(b.Date)-(set(b.Date)-set(a.Date)))), columns=["Date"] ) 
 df.Date = [date.date() for date in df.Date]

or df.Date.dt.date

(see How do I convert dates in a Pandas data frame to a 'date' data type?)

Ando Jurai
  • 1,003
  • 2
  • 14
  • 29
  • Thanks a lot. I don't want the different dates but more comparing two dataframes df1 and df2 and delete the non euqual dates from df2 and put that into a new dataframe. Is it clearer? – MCM Jul 04 '17 at 11:21
  • Yes it is. So that's exactly what my answer do. df2.Date.isin(df1.Date) returns a boolean where True is for dates of df2 similar to df1. So you boolean index these directly into df2. – Ando Jurai Jul 04 '17 at 11:29
  • So the answer above helped. Thanks for the help. – MCM Jul 04 '17 at 12:33