Compare two dataframe columns for matching strings or are substrings then count in pandas

Question

I have two dataframes (A and B). I want to compare strings in A and find a match or is contained in another string in B. Then count the amount of times A was matched or contained in B.

    Dataframe A
 
0   "4012, 4065, 4682"
1   "4712, 2339, 5652, 10007"
2   "4618, 8987"
3   "7447, 4615, 4012"
4   "6515"
5   "4065, 2339, 4012"


    Dataframe B

0   "6515, 4012, 4618, 8987"         <- matches (DF A, Index 2 & 4) (2: 4618, 8987), (4: 6515)
1   "4065, 5116, 2339, 8757, 4012"   <- matches (DF A, Index 5) (4065, 2339, 4012)
2   "1101"
3   "6515"                           <- matches (DF A, Index 4) (6515)
4   "4012, 4615, 7447"               <- matches (DF A, Index 3) (7447, 4615, 4012)
5   "7447, 6515, 4012, 4615"         <- matches (DF A, Index 3 & 4) (3: 7447, 4615, 4012 ), (4: 6515)


    Desired Output:

    Itemset             Count

2   4618, 8987            1
3   7447, 4165, 4012      2
4   6515                  3
5   4065, 2339, 4012      1

Basically, I want to count when there is a direct match of A in B (either in order or not) or if A is partially contained in B (in order or not). My goal is to count how many times A is being validated by B. These are all strings by the way.

when matching your values i get atleast 1 match for every row in df1, can you explain or re-check output? — Umar.H, Jun 22 '20 at 15:32
I changed the last number in DF2, Index 5 from "4615" to "6515" so that's my mistake. I need itemsets in column B to contain or exactly match the items in Column A. So for example, in (DF B, Index 1) I said it matched (DF A, Index 5). Even though (DF A, Index 0) contains almost the same items in (DF A, Index 5), every item in (DF A, Index 0) does not match (DF B, Index 1). But (DF A, Index 5) does match or contain every item in (DF B, Index 1). — Pysnek313, Jun 22 '20 at 16:02

MrNobody33 · Answer 1 · 2020-06-22T17:22:05.673

You can try something like this, using all, apply and sum:

import pandas as pd
import io 

#Creating the dataframes
A='''
col
"4012, 4065, 4682"
"4712, 2339, 5652, 10007"
"4618, 8987"
"7447, 4615, 4012"
"6515"
"4065, 2339, 4012"
'''

dfA = pd.read_csv(io.StringIO(A), sep='\s\s+', engine='python')

B='''
col
"6515, 4012, 4618, 8987"         
"4065, 5116, 2339, 8757, 4012"   
"1101"
"6515"                           
"4012, 4615, 7447"               
"7447, 6515, 4012, 6515" 
'''
dfB = pd.read_csv(io.StringIO(B), sep='\s\s+', engine='python')


#Function to convert the strings into a list    
tolist1=lambda x:[int(s.replace('"','')) for s in x.split(',')]



def countapp(x):
    #function that convert all strings in dfB into a list, then mask dfB based on condition (tolist1(x) is a sublist dfB) and then sum()
    return (dfB['col'].apply(tolist1).apply(lambda y: all(i in y for i in tolist1(x))).sum())

#Create the dataframe with the items:
counter=pd.DataFrame(data=dfA.values,columns=['Itemset'])

#Add the counts by applying countapp to each string of dfA
counter['Count']=dfA['col'].apply(countapp)

#Drop the rows with zero values
counter=counter[(counter['Count'] != 0)]

print(counter)

Output:

dfA
                         col
0         "4012, 4065, 4682"
1  "4712, 2339, 5652, 10007"
2               "4618, 8987"
3         "7447, 4615, 4012"
4                     "6515"
5         "4065, 2339, 4012"



dfB:
                          col
0        "6515, 4012, 4618, 8987"
1  "4065, 5116, 2339, 8757, 4012"
2                          "1101"
3                          "6515"
4              "4012, 4615, 7447"
5        "7447, 6515, 4012, 6515" 


counter with zero-value items:
                     Itemset  Count
0         "4012, 4065, 4682"      0
1  "4712, 2339, 5652, 10007"      0
2               "4618, 8987"      1
3         "7447, 4615, 4012"      1
4                     "6515"      3
5         "4065, 2339, 4012"      1


final counter:
              Itemset  Count
2        "4618, 8987"      1
3  "7447, 4615, 4012"      1
4              "6515"      3
5  "4065, 2339, 4012"      1

Tom · Accepted Answer · 2020-06-23T03:39:04.027

I'm rewriting this answer based on our discussions in the comments.

Rather than use apply, you can use a list comprehension to provide the same effect; the following creates a list with the desired calculation for each row

[sum(all(val in cell for val in row) for cell in dfB['values_list']) for row in dfA['values_list']]

While I originally found this significantly harder to parse than an apply function (and much harder to write), there is a tremendous advantage in speed. Here is your data, with the final two lines to split entries into lists:

import pandas as pd

dfA = pd.DataFrame(["4012, 4065, 4682",
                    "4712, 2339, 5652, 10007",
                    "4618, 8987",
                    "7447, 4615, 4012",
                    "6515",
                    "4065, 2339, 4012",],
                    columns=['values'])

dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
                    "4065, 5116, 2339, 8757, 4012",
                    "1101",
                    "6515",
                    "4012, 4615, 7447",
                    "7447, 6515, 4012, 4615"],
                    columns=['values'])

dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')

Here is a speed test using the gnarly list comp:

In[0]
%%timeit -n 1000
dfA['overlap_A'] = [sum(all(val in cell for val in row)
                    for cell in dfB['values_list']) 
                    for row in dfA['values_list']]

Out[0]
186 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And here is the same using an apply function, similar to that used in MrNobody33 's answer, and in my original (derivative) answer. Note that this function already uses some comprehensions, and presumably moving things to for loops would make things slower:

def check_overlap(row):
    return sum(all(val in cell for val in row['values_list']) for cell in dfB['values_list'])

In[1]:
%%timeit -n 1000
dfA['overlap_B'] = dfA.apply(check_overlap, axis=1)

Out[1]:
1.4 ms ± 61.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That's about 7x faster when not using apply! Note that the resulting output is the same:

                    values                values_list  overlap_A  overlap_B
0         4012, 4065, 4682         [4012, 4065, 4682]          0          0
1  4712, 2339, 5652, 10007  [4712, 2339, 5652, 10007]          0          0
2               4618, 8987               [4618, 8987]          1          1
3         7447, 4615, 4012         [7447, 4615, 4012]          2          2
4                     6515                     [6515]          3          3
5         4065, 2339, 4012         [4065, 2339, 4012]          1          1

Hi Tom, thanks for your reply. In response to your comment: "DFA index 3 is no longer in DF B index 5?", I made a mistake in my comment. I had the original dataframe formatted to how I wanted it, and I thought I had made a mistake when I read DataNovice's comment so I changed it. Looking over the dataframe again, I changed DF B, Index 5 back to how it was originally. So DF A index 3 should be included in DF B, Index 5. Again, sorry for the confusion. You mentioned you tried to not apply? I thought apply is pretty pythonic, no? or because you wanted to provide a different answer from Mr. nbdy? — Pysnek313, Jun 22 '20 at 17:39
I just edited with the data as I see on your post now, I believe it is giving the expected output. [And see this thread about `apply`](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code), basically you get convenience at the price of speed. I still like `apply` and think it is very intuitive to use, but I think it's a "bad habit", especially when scaling up. Both @MrNobody33 and I are using `apply`. If you don't have lots of data, it should be fine — Tom, Jun 22 '20 at 17:47
If you can't use `numpy` or `pandas` vectorized methods (which may be possible here, I don't know), one option to avoid `apply` is use list comprehensions instead, but I couldn't solve that here, and I think the answer may be fairly convoluted either way (more so than `apply`, I would guess). — Tom, Jun 22 '20 at 17:50
Thank you for the link to the info about apply. I have to check DF A (400 ~ 500 transaction strings) against about 100K different transactions / rows. I dont mind the slowness as long as it works. This is for validation purposes and I'm going to scatterplot the return count. Will only be run once a month. Would you consider this lots of data? — Pysnek313, Jun 22 '20 at 17:54
Honestly I'm not sure, but I reckon that could be enough to cause significant slow down. I would maybe try one of these approaches to see if it's suitable, and if not you can update/create a new post emphasizing speed/efficiency. — Tom, Jun 22 '20 at 18:01
Hi @Pysnek313, could we discuss [here](https://chat.stackoverflow.com/rooms/216449/discusswithpysnek313andmrnobody) ,if it's not too much trouble, why you accepted this Tom's answer instead of mine? Please, it's just to clarify :). — MrNobody33, Jun 22 '20 at 18:13
Hey @Tom could you enter again to the chat room please :), sorry if I'm being annoying. — MrNobody33, Jun 22 '20 at 19:26
@Tom -- do you think it is possible to turn this code into numpy? I have millions of rows to check with this code and I'm experiencing some performance issues. — Pysnek313, Mar 12 '21 at 16:30

Pysnek313 · Answer 3 · 2021-04-06T19:21:01.870

0

Here's the need for speed edition for those that need to go fast:

Compare two dataframe columns for matching strings or are substrings then count in pandas (Need for speed edition)

Courtesy of @bing-wang

edited Apr 06 '21 at 19:21

answered Apr 06 '21 at 12:50

Pysnek313

134
14

Compare two dataframe columns for matching strings or are substrings then count in pandas

3 Answers3

Linked