comparing the two data frames

Question

this is in reference to the issue I raised here: compare two pandas dataframes with unequal columns

also in reference: How to implement 'in' and 'not in' for Pandas dataframe

I have created two pandas data frames :

DataFrame: words

                  0
0           limited
1         desirable
2           advices

DataFrame: mcDonaldWL

            Word       Negative   Positive  Uncertainty
9            abandon     2009        0           0
10         abandoned     2009        0           0
11        desirables        0        2009        0
12       abandonment     2009        0           0
13           advices     2009        0           0
14          abandons     2009        0           0

My goal here is to compare words[0] to mcDonaldWL['Word'] and if the ith element occurs, present the results.

Result 
              Word       Negative   Positive  Uncertainty
 11        desirables        0       2009        0
 13           advices     2009        0           0

I tried to use set, intersection, merge, but can't find a solution. Any thoughts?

It does not generate the required answer. This is not duplicate.

If I run

words[~words.word.isin(mcDonaldWL)]

I get:

    word
0   limited
1   desirable

Are you looking for *exact* matches or not? – juanpa.arrivillaga Oct 31 '17 at 23:33 — juanpa.arrivillaga, Oct 31 '17 at 23:33

score 1 · Accepted Answer · answered Oct 31 '17 at 23:22

Suppose you have:

>>> df1
         col1
0     limited
1  desirables
2     advices
>>> df2
           Word  Negative  Positive  Uncertainty
9       abandon      2009         0            0
10    abandoned      2009         0            0
11   desirables         0      2009            0
12  abandonment      2009         0            0
13      advices      2009         0            0
14     abandons      2009         0            0

Note, I've given your first data-frame a proper column label. Anyway, the simplest thing to do is to use Word as an index:

>>> df2.set_index('Word', inplace=True)
>>> df2
             Negative  Positive  Uncertainty
Word
abandon          2009         0            0
abandoned        2009         0            0
desirables          0      2009            0
abandonment      2009         0            0
advices          2009         0            0
abandons         2009         0            0

Then you can use indexing!

>>> df2.loc[df1.col1.values]
            Negative  Positive  Uncertainty
Word
limited          NaN       NaN          NaN
desirables       0.0    2009.0          0.0
advices       2009.0       0.0          0.0
>>> df2.loc[df1.col1.values].dropna()
            Negative  Positive  Uncertainty
Word
desirables       0.0    2009.0          0.0
advices       2009.0       0.0          0.0
>>>

score 1 · Answer 2 · answered Oct 31 '17 at 23:28

1

By using fuzzy match

from fuzzywuzzy import process
l=words.iloc[:,0].values.tolist()

a=[]
for x in mcDonaldWL.Word:
    if [process.extract(x, l, limit=1)][0][0][1]>=80:
        a.append([process.extract(x, l, limit=1)][0][0][0])
    else:
        a.append(np.nan)

mcDonaldWL['canfind']=a
mcDonaldWL.dropna().drop('canfind',1)


Out[494]: 
          Word  Negative  Positive  Uncertainty
11  desirables         0      2009            0
13     advices      2009         0            0

answered Oct 31 '17 at 23:28

BENY

317,841
20
164
234

Thanks Wen. This probably could be used mostly for NLP than a strict dataframe comparison. – lpt Nov 01 '17 at 22:41
@lpt yes you are right cause your show case so i choose this one – BENY Nov 01 '17 at 22:43

score 1 · Answer 3 · answered Oct 31 '17 at 23:48

Approach 1

ws = words.values.ravel().astype(str)
wl = mcDonaldWL.Word.values.astype(str)

mcDonaldWL[(np.core.defchararray.find(wl[:, None], ws) >= 0).any(1)]

          Word  Negative  Positive  Uncertainty
11  desirables         0      2009            0
13     advices      2009         0            0

Approach 2

mcDonaldWL[mcDonaldWL.Word.str.contains('|'.join(words.values.ravel()))]

          Word  Negative  Positive  Uncertainty
11  desirables         0      2009            0
13     advices      2009         0            0

Acccumulation · Answer 4 · 2017-10-31T23:33:30.473

0

In words, you have "desirable", but in mcDonaldWL, you have "desirables". Assuming these are supposed to be the same, you can do:

mcDonaldWL.set_index('Word', inplace=True)
mcDonaldWL.loc[words[0]]

Also, "advices" is not a word.

edited Oct 31 '17 at 23:33

answered Oct 31 '17 at 23:28

Acccumulation

3,491
1
8
12

comparing the two data frames

4 Answers4