4

Consider the following DataFrame:

link    tags            views
/a      [tag_a, tag_b]  100
/b      [tag_a, tag_c]  200
/c      [tag_b, tag_c]  150

What would be an efficient way to 'groupby' items within a list in the tags column. For instance, if one were to find the cumulative views for each tag in the DataFrame above, the result would be:

tag     views
tag_a   300
tag_b   250
tag_c   350

So far, this is what I have come up with:

# get all unique tags
all_tags = list(set([item for sublist in df.tags.tolist() for item in sublist]))

# get a count of each tag 
tag_views = {tag: df[df.tags.map(lambda x: tag in x)].views.sum() for tag in all_tags}

This approach is rather slow for a large dataset. Is there a more efficient way (perhaps using the builtin groupby function) of doing this?

  • Why do you turn `all_tags` back into a `list`, wouldn't a `set` be sufficient? And I would just use the generator form for it's construction: `set(item ... in sublist)` avoid constructing the intermediate list. – AChampion Jan 19 '16 at 05:20
  • @Achampion: it is faster to iterate over a list than a set. Since `all_tags` is only used for iteration, I decided to use a list. Regarding using `set(item ... in sublist)`, I think it is slightly inefficient (although I might be misinterpreting your point). [This question](http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python) talks about it in greater detail. – Rishabh Srivastava Jan 19 '16 at 05:30
  • I know this is minor but you effectively have `set(list(item for sublist in df.tags.tolist() for item in sublist))` you just don't need the inner `list`, i.e. `set(item for sublist in df.tags.tolist() for item in sublist)` is sufficient. And why do you believe iterating a `set` is slower than constructing a `list` and iterating it? – AChampion Jan 19 '16 at 05:46
  • Ah got it, thanks for the pointer. Re iterating a set: I have a very large collection of tags (~100k), so figured that converting from a set to list might make a substantial difference. Will try them both out and compare later tonight. – Rishabh Srivastava Jan 19 '16 at 05:57
  • For very large datasets you might want to look into apache spark and a cluster of machines. – AChampion Jan 19 '16 at 07:02

1 Answers1

0

You could split the tags column into multiple rows and then groupby:

df = pd.DataFrame(...)
tag = pd.DataFrame(df.tags.tolist()).stack()
tag.index = tag.index.droplevel(-1)
tag.name = 'tag'
df.join(tag).groupby('tag').sum()

Result:

       views
tag         
tag_a    300
tag_b    250
tag_c    350

This will not be very space efficient because of the join, especially for a high number of tags per url. For a small number of tags I would be interested to hear about the timings.

Alternatively use a multi-index:

df = pd.DataFrame(...)
all_tags = [...]
groups = df.tags.map(lambda cell: tuple(tag in cell for tag in all_tags))
df.index = pd.MultiIndex.from_tuples(groups.values, names=all_tags)
for t in all_tags:
    print(t, df.xs(True, level=t).views.sum())

Result:

tag_a 300
tag_b 250
tag_c 350
AChampion
  • 29,683
  • 4
  • 59
  • 75
  • Works like a charm - thank you. You made a small typo in the answer, the last line should be `df.join(tag).groupby('tag').sum()`. I ended up using a different approach to solve my problem (created a new binary column for every tag, and did a matrix multiplication followed by a sum) - this was ~2.5x faster than my original attempt, but your approach is far more elegant. – Rishabh Srivastava Jan 19 '16 at 05:39
  • Thanks fixed, Binary column sounds similar to the multiindex approach. – AChampion Jan 19 '16 at 05:48