6

I have a column in a pandas dataframe which contains a large number of lists of labels:

>>> data['SPLIT'].head(10)
0    [33.23, 88.72, 38.93, E931.7, V09.0, 041.11, 5...
1    [99.04, 38.06, 39.57, 00.91, 55.69, V15.82, 27...
2    [96.04, 96.72, 401.9, 276.5, 584.9, 428.0, 507...
3    [96.6, 99.15, 99.83, V29.0, 765.15, 765.25, 77...
4    [96.71, 96.04, 54.12, 99.60, 38.93, 99.15, 53....
5    [88.72, 37.61, 39.61, 36.15, 36.12, 272.0, 401...
6    [38.93, 88.72, 37.31, 272.4, 719.46, 722.0, 31...
7    [88.72, 39.61, 35.71, 272.4, V12.59, 458.29, 7...
8    [97.44, 99.04, 88.56, 37.23, 39.95, 38.95, 00....
9    [00.14, 89.61, 39.95, E878.8, 244.9, 443.9, 18...

What I am trying to do is iterate through all of these lists to find the total occurrence of each value so that I can find the 50 most frequently occurring values.

This is the code I have used which runs extremely slowly:

test = pd.Series(sum([item for item in data.SPLIT], [])).value_counts()

I tried to write a function outside to loop through the lists and find the count as well, but this was also very slow.

Is there any way I can modify this data or use a function in pandas that has similar performance to something like df.groupby.count()?

I did search for a half hour on google and stackoverflow but none of the answers have better performance. I have been trying for a long time to figure out a way to flatten the list or find a way to map the counts with better speed (iterating through 500k rows, and the length of each list varies, some can be as long as 512, others as short as 2).

tripleee
  • 175,061
  • 34
  • 275
  • 318
Ben C Wang
  • 617
  • 10
  • 19

2 Answers2

9

Use list comprehension with flattening instead sum:

test = pd.Series([x for item in data.SPLIT for x in item]).value_counts()

Or flatten by chain.from_iterable:

from itertools import chain

test = pd.Series(list(chain.from_iterable(data.SPLIT))).value_counts()

Or use also collections.Counter:

from itertools import chain
from collections import Counter

test = pd.Series(Counter(chain.from_iterable(data.SPLIT)))

Or:

import functools, operator

test = pd.Series(functools.reduce(operator.iconcat, data.SPLIT, [])).value_counts()

Pure pandas solution:

test = pd.DataFrame(data.SPLIT.values.tolist()).stack().value_counts()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Holy crap man, that first solution looks almost the same as what I wrote but it completed almost instantaneously. Thank you so much. I'm going figure out why yours is so much faster, I appreciate the guidance. – Ben C Wang Apr 14 '19 at 08:08
  • @BenCWang - Check [this](https://stackoverflow.com/a/45323085/2901002) - also added the fastest solution to answer. – jezrael Apr 14 '19 at 08:35
2

How about this?

import pandas as pd

split = data["SPLIT"].apply(pd.Series)
split = split.rename(columns = lambda x : 'val_' + str(x))
split.melt(value_name="val").groupby(["val"]).size()
pythonjokeun
  • 431
  • 2
  • 8