1

I am trying to get familiar with python coding and I would like to ask a little help in the following task.

I have imported two data frames from excel dfA and dfB with pandas. I would like to count the matches of each lines from dfA in dfB. To do this I converted dfSearch = dfA['Title'].tolist() to pass this as a list of values to search for.

My approach is the following:

for i in searchDF:
    result = dfB['COL1'].count(i)

Then I would like to add a new column in dfA which will store the results of each line.

    dfA['FIND_VAL1'] = result

I am sorry if this task seems trivial, but I am completely new to python and rally need some help.

Data example A:

title 
plane 
house 
car

Data example B:

title 
aero plane 
household 
luxury cars 
house decorations

Result example:

title   Results    
plane     1     
house     2    
car       1
cs95
  • 379,657
  • 97
  • 704
  • 746
simpleMan
  • 55
  • 1
  • 1
  • 8
  • 1
    Have you tried inner join in pandas on those columns ? – Kush Patel Aug 31 '17 at 14:47
  • 5
    Please provide samples of dfA and dfB along with expected output. See this article [how to ask questions](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Scott Boston Aug 31 '17 at 14:47

3 Answers3

1

You could call str.count in a list comprehension.

dfA['Results'] = [dfB.title.str.count(x).sum() for x in dfA.title]
dfA

   title  Results
0  plane        1
1  house        2
2    car        1

An alternative list comprehension using Pure Python sum and str.count as suggested by piR:

dfA = dfA.assign(Results=[sum([x.count(y) for x in dfB.title.values.tolist()]) 
                                       for y in dfA.title.values.tolist()])
dfA
   title  Results
0  plane        1
1  house        2
2    car        1

This one seems faster for small data, but may not scale as well.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • If you're going to use comprehension, don't stop half-way `dfA.assign(Results=[sum([x.count(y) for x in dfB.title.values.tolist()]) for y in dfA.title.values.tolist()])` This is quicker. – piRSquared Aug 31 '17 at 16:17
  • @piRSquared Are you sure it scales for larger data? – cs95 Aug 31 '17 at 16:23
  • Haven't done that test yet. But your solution's time complexity is the same. Both scale quadratically. As does mine! I don't know a way around that. `O(nxm)` But the comprehension is quicker over small data than `str.count().sum()`. – piRSquared Aug 31 '17 at 16:25
  • @piRSquared The annoying thing is you'd have to look for substring matches too. Otherwise you could've used a `collections.Counter` and done this in linear time. – cs95 Aug 31 '17 at 16:30
  • @COLDSPEED I have tested your solution and I run into error: multiple repeat at position 23 Why did I got an erro? – simpleMan Sep 01 '17 at 09:25
  • @simpleMan I don't know. I can't tell from the one sentence you've given me. Update your pandas. – cs95 Sep 02 '17 at 05:16
1

Use the count ufunc from numpy.core.defchararray with some numpy broadcasting magic.

from numpy.core.defchararray import count

b = dfB.title.values.astype(str)
a = dfA.title.values[:, None]
dfA.assign(Results=count(b, a).sum(1))

   title  Results
0  plane        1
1  house        2
2    car        1

Setup

dfA = pd.DataFrame(dict(title=['plane', 'house', 'car']))

dfB = pd.DataFrame(dict(
    title=['aero plane', 'household', 'luxury cars', 'house decorations']
))
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

I would first try merging the dataframes:

df = pd.merge(dfA, dfB, on = "title")
kjmerf
  • 4,275
  • 3
  • 21
  • 29