Count and remove elements by conditions from dataframe columns of lists

Question

Suppose I have a Pandas df

      col_name
1    [16, 4, 30]   
2    [5, 1, 2]   
3    [4, 5, 52, 888]
4    [1, 2, 4]
5    [5, 99, 4, 75, 1, 2]

I would like to remove all the elements in the whole column that appears less than x times, for example let's take x = 3

It means that I would like to have the result looks like:

      col_name
1    [4]   
2    [5, 1, 2]   
3    [4, 5]
4    [1, 2, 4]
5    [5, 4, 1, 2]

The result df basically removes the number 16, 30, 52, 888, 99 and 75 because it appears less than 3 times in the column.

I tried using Counter from collections but it didn't work.

Really appreciate if you could give me any hints. Thanks in advance.

piRSquared · Answer 1 · 2017-09-13T09:56:14.853

Option 1
Somewhat plain vanilla approach

s = pd.Series({(i, j): x for (i, r) in df.col_name.items() for j, x in enumerate(r)})

f, u = pd.factorize(s.values)

s[(np.bincount(f) >= 3)[f]].groupby(level=0).apply(list).to_frame('col_name')

       col_name
0           [4]
1     [5, 1, 2]
2        [4, 5]
3     [1, 2, 4]
4  [5, 4, 1, 2]

Option 2
Over the top complicated approach

lens = df.col_name.str.len().values
splits = lens.cumsum()[:-1]
values = np.concatenate(df.col_name.values)
f, u = pd.factorize(values)
b = np.bincount(f)
r = np.arange(len(df)).repeat(lens)
m = (b >= 3)[f]
new_splits = splits - np.bincount(r, ~m).astype(int).cumsum()[:-1]
new_values = np.split(values[m], new_splits)
df.assign(col_name=new_values)

       col_name
0           [4]
1     [5, 1, 2]
2        [4, 5]
3     [1, 2, 4]
4  [5, 4, 1, 2]

score 2 · Answer 2 · answered Sep 13 '17 at 07:58

First get counts and then apply or applymap your conditional check for elements.

In [2707]: counts = pd.Series([v for x in df.col_name for v in x]).value_counts()

In [2708]: df.col_name.apply(lambda x: [v for v in x if counts[v] >= 3])
Out[2708]:
1             [4]
2       [5, 1, 2]
3          [4, 5]
4       [1, 2, 4]
5    [5, 4, 1, 2]
Name: col_name, dtype: object

In [2709]: df.applymap(lambda x: [v for v in x if counts[v] >= 3])
Out[2709]:
       col_name
1           [4]
2     [5, 1, 2]
3        [4, 5]
4     [1, 2, 4]
5  [5, 4, 1, 2]

Details

In [2710]: counts
Out[2710]:
4      4
5      3
2      3
1      3
30     1
888    1
52     1
16     1
75     1
99     1
dtype: int64

score 2 · Answer 3 · answered Sep 13 '17 at 08:03

You can use Counter() from collections:

import pandas as pd
from collections import Counter

limit = 3

df = pd.DataFrame({'col_name': [[16, 4, 30], [5, 1, 2], [4, 5, 52, 888], [1, 2, 4], [5, 99, 4, 75, 1, 2]]})

flat = Counter([y for x in df.col_name for y in x])
desired = [k for k, v in flat.items() if v >= limit]

df['col_name'] = df['col_name'].apply(lambda x: [i for i in x if i in desired])

jezrael · Accepted Answer · 2017-09-13T08:40:27.447

2

You can get all values less as treshold by value_counts with boolean indexing:

from  itertools import chain

a = pd.Series(list(chain.from_iterable(df['col_name']))).value_counts()
a = a.index[a >= 3]
print (a)
Int64Index([4, 5, 2, 1], dtype='int64')

df = df.applymap(lambda x: [v for v in x if v in a])
print (df)
       col_name
1           [4]
2     [5, 1, 2]
3        [4, 5]
4     [1, 2, 4]
5  [5, 4, 1, 2]

edited Sep 13 '17 at 08:40

answered Sep 13 '17 at 08:04

jezrael

822,522
95
1,334
1,252

1

Nice, I fleeting thought about prefiltering, neat sir. – Zero Sep 13 '17 at 08:07
@JohnGalt - Thank you. – jezrael Sep 13 '17 at 08:11
`np.concatenate(df.col_name.tolist())` should be much faster than itertools. – cs95 Sep 13 '17 at 08:41
Maybe in small data, in large not. Check [this](https://stackoverflow.com/a/44753790/2901002) – jezrael Sep 13 '17 at 08:43
Yes, I was surprised too first time. and `sum` is the worst solution - very slow. – jezrael Sep 13 '17 at 08:45

score 2 · Answer 5 · answered Sep 13 '17 at 08:52

Similar to this, using collections.Counter (but developed independently, with a few optimisations);

from collections import Counter
c = Counter(pd.Series(np.concatenate(df.col_name.tolist())))

def foo(array):
    return [x  for x in array if c[x] >= 3]

df.col_name = df.col_name.apply(foo)
df

       col_name
1           [4]
2     [5, 1, 2]
3        [4, 5]
4     [1, 2, 4]
5  [5, 4, 1, 2]

Count and remove elements by conditions from dataframe columns of lists

5 Answers5

Linked