I have 2 lists of Stack Overflow questions, group A and group B. Both have two columns, Id and Tag. e.g:
|Id |Tag
| -------- | --------------------------------------------
|2 |c#,winforms,type-conversion,decimal,opacity
For each question in group A, I need to find in group B all matched questions that have at least one overlapping tag the question in group A, independent of the position of tags. For example, these questions should all be matched questions:
|Id |Tag
|----------|---------------------------
|3 |c#
|4 |winforms,type-conversion
|5 |winforms,c#
My first thought was to convert the variable Tag into a set variable and merge using Pandas because set ignores position. However, it seems that Pandas doesn't allow a set variable to be the key variable. So I am now using for loop to search over group B. But it is extremely slow since I have 13 million observation in group B.
My question is: 1. Is there any other way in Python to merge by a column of collection and can tell the number of overlapping tags? 2. How to improve the efficiency of for loop search?