How to filter values using Pandas?

Question

Intention: To filter binary numbers based on hamming weights using pandas. Here i check number of 1s occurring in the binary and write the count to df.

Effort so far:

import pandas as pd
def ones(num):
    return bin(num).count('1')
num = list(range(1,8))
C = pd.Index(["num"])
df = pd.DataFrame(num, columns=C)
df['count'] = df.apply(lambda row : ones(row['num']), axis = 1)
print(df)

output:

   num  count
0    1      1
1    2      1
2    3      2
3    4      1
4    5      2
5    6      2
6    7      3


Intended output:
  1 2 3
0 1 3 7
1 2 5
2 4 6

Help!

yatu · Accepted Answer · 2020-06-26T14:32:35.543

You can use pivot_table. Though you'll need to define the index as the cumcount of the grouped count column, pivot_table can't figure it out all on its own :)

(df.pivot_table(index=df.groupby('count').cumcount(), 
                columns='count', 
                values='num'))

count    1    2    3
0      1.0  3.0  7.0
1      2.0  5.0  NaN
2      4.0  6.0  NaN

You also have the parameter fill_value, though I wouldn't recommend you to use it, since you'll get mixed types. Now it looks like NumPy would be a good option from here, you can easily obtain an array from the result with new_df.to_numpy().

Also, focusing on the logic in ones, we can vectorise this with (based on this answer):

m = df.num.to_numpy().itemsize
df['count'] = (df.num.to_numpy()[:,None] & (1 << np.arange(m)) > 0).view('i1').sum(1)

Here's a check on both approaches' performance:

df_large = pd.DataFrame({'num':np.random.randint(0,10,(10_000))})

def vect(df):
    m = df.num.to_numpy().itemsize
    (df.num.to_numpy()[:,None] & (1 << np.arange(m)) > 0).view('i1').sum(1)

%timeit vect(df_large)
# 340 µs ± 5.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df_large.apply(lambda row : ones(row['num']), axis = 1)
# 103 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Dan · Answer 2 · 2020-06-26T14:01:17.853

1

I suggest a different output:

df.groupby("count").agg(list)

which will give you

             num
count           
1      [1, 2, 4]
2      [3, 5, 6]
3            [7]

it's the same information in a slightly different format. In your original pivoted format, the rows are meaningless and you have an undetermined number of columns. I suggest it is more common to have an undetermined number of rows. I think you'll find this easier to work with going forward.

Or consider just creating a dictionary as a DataFrame is adding a lot of overhead here for no benefit:

df.groupby("count").agg(list).to_dict()["num"]

which gives you

{
    1: [1, 2, 4], 
    2: [3, 5, 6], 
    3: [7],
}

edited Jun 26 '20 at 14:01

answered Jun 26 '20 at 13:55

Dan

45,079
17
88
157

OP wants to group the numbers with the same numbers of `1`s in the binary representation. I don't think a pivot is the best data structure for the output, this is an alternative they might not have thought of – Dan Jun 26 '20 at 13:59
1

A df of lists is never a good idea if can be avoided. Performance drops one large dataframes even with the simplest operations – yatu Jun 26 '20 at 14:00
tbh it should probably be a dictionary. A df doesn't make a lot of sense either way. – Dan Jun 26 '20 at 14:01

score 0 · Answer 3 · answered Jun 26 '20 at 15:21

0

Here's one approach

df.groupby('count')['num'].agg(list).apply(pd.Series).T

answered Jun 26 '20 at 15:21

Mark Wang

2,623
7
15

How to filter values using Pandas?

3 Answers3