1

this is in reference to the issue I raised here: compare two pandas dataframes with unequal columns

also in reference: How to implement 'in' and 'not in' for Pandas dataframe

I have created two pandas data frames :

DataFrame: words

                  0
0           limited
1         desirable
2           advices

DataFrame: mcDonaldWL

            Word       Negative   Positive  Uncertainty
9            abandon     2009        0           0
10         abandoned     2009        0           0
11        desirables        0        2009        0
12       abandonment     2009        0           0
13           advices     2009        0           0
14          abandons     2009        0           0

My goal here is to compare words[0] to mcDonaldWL['Word'] and if the ith element occurs, present the results.

Result 
              Word       Negative   Positive  Uncertainty
 11        desirables        0       2009        0
 13           advices     2009        0           0

I tried to use set, intersection, merge, but can't find a solution. Any thoughts?

It does not generate the required answer. This is not duplicate.

If I run

words[~words.word.isin(mcDonaldWL)] 

I get:

    word
0   limited
1   desirable
lpt
  • 931
  • 16
  • 35

4 Answers4

1

Suppose you have:

>>> df1
         col1
0     limited
1  desirables
2     advices
>>> df2
           Word  Negative  Positive  Uncertainty
9       abandon      2009         0            0
10    abandoned      2009         0            0
11   desirables         0      2009            0
12  abandonment      2009         0            0
13      advices      2009         0            0
14     abandons      2009         0            0

Note, I've given your first data-frame a proper column label. Anyway, the simplest thing to do is to use Word as an index:

>>> df2.set_index('Word', inplace=True)
>>> df2
             Negative  Positive  Uncertainty
Word
abandon          2009         0            0
abandoned        2009         0            0
desirables          0      2009            0
abandonment      2009         0            0
advices          2009         0            0
abandons         2009         0            0

Then you can use indexing!

>>> df2.loc[df1.col1.values]
            Negative  Positive  Uncertainty
Word
limited          NaN       NaN          NaN
desirables       0.0    2009.0          0.0
advices       2009.0       0.0          0.0
>>> df2.loc[df1.col1.values].dropna()
            Negative  Positive  Uncertainty
Word
desirables       0.0    2009.0          0.0
advices       2009.0       0.0          0.0
>>>
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
1

By using fuzzy match

from fuzzywuzzy import process
l=words.iloc[:,0].values.tolist()

a=[]
for x in mcDonaldWL.Word:
    if [process.extract(x, l, limit=1)][0][0][1]>=80:
        a.append([process.extract(x, l, limit=1)][0][0][0])
    else:
        a.append(np.nan)

mcDonaldWL['canfind']=a
mcDonaldWL.dropna().drop('canfind',1)


Out[494]: 
          Word  Negative  Positive  Uncertainty
11  desirables         0      2009            0
13     advices      2009         0            0
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Approach 1

ws = words.values.ravel().astype(str)
wl = mcDonaldWL.Word.values.astype(str)

mcDonaldWL[(np.core.defchararray.find(wl[:, None], ws) >= 0).any(1)]

          Word  Negative  Positive  Uncertainty
11  desirables         0      2009            0
13     advices      2009         0            0

Approach 2

mcDonaldWL[mcDonaldWL.Word.str.contains('|'.join(words.values.ravel()))]

          Word  Negative  Positive  Uncertainty
11  desirables         0      2009            0
13     advices      2009         0            0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

In words, you have "desirable", but in mcDonaldWL, you have "desirables". Assuming these are supposed to be the same, you can do:

mcDonaldWL.set_index('Word', inplace=True)
mcDonaldWL.loc[words[0]]

Also, "advices" is not a word.

Acccumulation
  • 3,491
  • 1
  • 8
  • 12