1

I have 2 lists of Stack Overflow questions, group A and group B. Both have two columns, Id and Tag. e.g:

|Id        |Tag
| -------- | --------------------------------------------
|2         |c#,winforms,type-conversion,decimal,opacity

For each question in group A, I need to find in group B all matched questions that have at least one overlapping tag the question in group A, independent of the position of tags. For example, these questions should all be matched questions:

|Id        |Tag
|----------|---------------------------
|3         |c#
|4         |winforms,type-conversion
|5         |winforms,c#

My first thought was to convert the variable Tag into a set variable and merge using Pandas because set ignores position. However, it seems that Pandas doesn't allow a set variable to be the key variable. So I am now using for loop to search over group B. But it is extremely slow since I have 13 million observation in group B.

My question is: 1. Is there any other way in Python to merge by a column of collection and can tell the number of overlapping tags? 2. How to improve the efficiency of for loop search?

cs95
  • 379,657
  • 97
  • 704
  • 746
Xiaomeng
  • 43
  • 8
  • How did you get 13 million stack overflow questions? The API would never let you do that fast enough. – cs95 Jul 11 '17 at 01:41
  • I downloaded from Stack Exchange data dumps @COLDSPEED – Xiaomeng Jul 11 '17 at 01:45
  • Okay. Next question. Is Tag a string of comma separated tags type or list of tags? – cs95 Jul 11 '17 at 01:48
  • The Tag is a string of bracket separated tags. But I can convert it into a list of tags using regex. I am now reading your answer and it seems very promising. Thank you very much! – Xiaomeng Jul 11 '17 at 02:43

2 Answers2

3

This can be achieved using df.join and df.groupby.

This is the setup I'm working with:

df1 = pd.DataFrame({ 'Id' : [2], 'Tag' : [['c#', 'winforms', 'type-conversion', 'decimal', 'opacity']]}) 

   Id                                                Tag
0   2  [c#, winforms, type-conversion, decimal, opacity]

df2 = pd.DataFrame({ 'Id' : [3, 4, 5], 'Tag' : [['c#'], ['winforms', 'type-conversion'], ['winforms', 'c#']]})  

   Id                          Tag
0   3                         [c#]
1   4  [winforms, type-conversion]
2   5               [winforms, c#]

Let's flatten out the right column in both data frames. This helped:

In [2331]: from itertools import chain

In [2332]: def flatten(df):
      ...:     return pd.DataFrame({"Id": np.repeat(df.Id.values, df.Tag.str.len()),
      ...:                          "Tag": list(chain.from_iterable(df.Tag))})
      ...: 

In [2333]: df1 = flatten(df1)

In [2334]: df2 = flatten(df2)

In [2335]: df1.head()
Out[2335]: 
   Id              Tag
0   2               c#
1   2         winforms
2   2  type-conversion
3   2          decimal
4   2          opacity

And similarly for df2, which is also flattened.

Now is the magic. We'll do a join on Tag column, and then groupby on joined IDs to find count of overlapping tags.

In [2337]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index()
Out[2337]: 
   Id_x  Id_y  Tag
0     2     3    1
1     2     4    2
2     2     5    2

The output shows each pair of tags along with the number of overlapping tags. Pairs with no overlaps are filtered out by the groupby.

The df.count counts overlapping tags, and df.reset_index just prettifies the output, since groupby assigns the grouped column as the index, so we reset it.

To see matching tags, you'll modify the above slightly:

In [2359]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y'])['Tag'].apply(list).reset_index()
Out[2359]: 
   Id_x  Id_y                          Tag
0     2     3                         [c#]
1     2     4  [winforms, type-conversion]
2     2     5               [c#, winforms]

To filter out 1-overlaps, chain a df.query call to the first expression:

In [2367]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index().query('Tag > 1')
Out[2367]: 
   Id_x  Id_y  Tag
1     2     4    2
2     2     5    2 
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Is there any way to also show the overlapping tags? – Xiaomeng Jul 11 '17 at 02:44
  • @COLDSPEED Sorry I forgot to mention before. Can I limit the number of overlapping tags when do the join? Because there are too many questions with one or two overlapping tags and will make the output table too long. So I would like to limit to questions that have only 3 or more overlapping tags. Thanks! – Xiaomeng Jul 11 '17 at 03:50
  • @Xiaomeng Use `.query`, as I edited into my post. From here on, please refrain from piling on requests and details in the comments. It hinders readability for future readers. One query at a time please. If this solved your question, please mark it accepted and ask a new one. I will not answer any more queries in the comments. – cs95 Jul 11 '17 at 08:36
  • @COLDSPEED I have trouble flattening df2, it gives me error "cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'". I've tried to dtype=int64 when import df2 from csv file. But I still got the same error. However, df1 has no problem being flattened using your function. The only difference between df1 and df2 is that df2 has much more observations than df1. And the df2.Id column might have longer int. – Xiaomeng Jul 12 '17 at 08:02
  • @Xiaomeng Hmm, you can try an explicit cast, like `df2['Id'] = df2['Id'].astype(np.int32)`. – cs95 Jul 12 '17 at 08:47
  • @COLDSPEED Acutally ['Id'] is not the problematic column. Somehow df2.Tag.str.len() returns a float value such as 3.0, 4.0. df1.Tag.str.len() returns int value such as 3 and 4. When I tried to convert df2.Tag.str.len() into an int series, it gives me "can't convert non-finite values". This is where I got stuck now. – Xiaomeng Jul 12 '17 at 09:37
  • Really? Okay, just try `int(df2.Tag.str.len())` – cs95 Jul 12 '17 at 09:39
  • It won't work either: "TypeError: cannot convert the series to " – Xiaomeng Jul 12 '17 at 10:04
  • @Xiaomeng `len(df2.Tag)` at least? – cs95 Jul 12 '17 at 10:06
  • @Xiaomeng Judging by the error you get, your data frame is probably more complicated in the structure than the question posed here might suggest. If the solution in my last comment does not solve your problem, please ask a new question and make sure to provide a [mcve]. – cs95 Jul 12 '17 at 10:07
  • @COLDSPEED. It is because I have have one missing value in the raw data. However, I have a "Memory Error" when trying to merge the two flattened dataframe. Should I ask a new question about this? Or maybe we can move to a chat room to discuss? Thanks! – Xiaomeng Jul 13 '17 at 02:51
  • 1
    @Xiaomeng Definitwly ask a new question . I don't think I know how to solve a memory error! – cs95 Jul 13 '17 at 08:44
0
  • Step 1 List down all tags
  • Step 2 create binary representation of each tag, i.e. use bit 1 or 0 to represent whether have or not have the tag
  • Step 3 to find any ID share the same tag, you could call a simple apply function to decode your binary representation.

In terms of processing speed, it should be all right. However, if number of tags are too huge, there might be memory issue. If you only need to find questions with same tag for one Id only, I will suggest you write a simple function and calling df.apply. If you need to check for a lot of IDs and find questions with same tag, I will say above approach will be better.

(Intended to leave it as comment, but not enough reputation... sigh)

White
  • 627
  • 4
  • 10
  • There are at least 10 thousands of tags. If I use binary representation, I will have thousands of columns to indicate the tags of a question. Will that be efficient? – Xiaomeng Jul 11 '17 at 02:31
  • 1
    If it is over 10 thousands of tags, I think you should not use the methods I proposed above. I think the method purposed by Coldspeed will be much better. – White Jul 11 '17 at 04:07