I have two dataframes (A and B). I want to compare strings in A and find a match or is contained in another string in B. Then count the amount of times A was matched or contained in B.
Dataframe A
0 "4012, 4065, 4682"
1 "4712, 2339, 5652, 10007"
2 "4618, 8987"
3 "7447, 4615, 4012"
4 "6515"
5 "4065, 2339, 4012"
Dataframe B
0 "6515, 4012, 4618, 8987" <- matches (DF A, Index 2 & 4) (2: 4618, 8987), (4: 6515)
1 "4065, 5116, 2339, 8757, 4012" <- matches (DF A, Index 5) (4065, 2339, 4012)
2 "1101"
3 "6515" <- matches (DF A, Index 4) (6515)
4 "4012, 4615, 7447" <- matches (DF A, Index 3) (7447, 4615, 4012)
5 "7447, 6515, 4012, 4615" <- matches (DF A, Index 3 & 4) (3: 7447, 4615, 4012 ), (4: 6515)
Desired Output:
Itemset Count
2 4618, 8987 1
3 7447, 4165, 4012 2
4 6515 3
5 4065, 2339, 4012 1
Basically, I want to count when there is a direct match of A in B (either in order or not) or if A is partially contained in B (in order or not). My goal is to count how many times A is being validated by B. These are all strings by the way.
EDIT Need for speed edition: This is a redo question from my previous post: Compare two dataframe columns for matching strings or are substrings then count in pandas
I have millions of rows in both dfA and dfB to make these comparisons against. In my previous post, the following code got the job done:
import pandas as pd
dfA = pd.DataFrame(["4012, 4065, 4682",
"4712, 2339, 5652, 10007",
"4618, 8987",
"7447, 4615, 4012",
"6515",
"4065, 2339, 4012",],
columns=['values'])
dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
"4065, 5116, 2339, 8757, 4012",
"1101",
"6515",
"4012, 4615, 7447",
"7447, 6515, 4012, 4615"],
columns=['values'])
dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')
dfA['overlap_A'] = [sum(all(val in cell for val in row)
for cell in dfB['values_list'])
for row in dfA['values_list']]
However with the total amount of rows to check, I am experiencing a performance issue and need another way to check the frequency / counts. Seems like Numpy is needed in this case. this is about the extent of my numpy knowledge as I work primarily in pandas. Anyone have suggestions to make this faster?
dfA_array = dfA['values_list'].to_numpy()
dfB_array = dfB['values_list'].to_numpy()