Pandas: Help transforming data and writing better code

Question

I have two data sources I can join by a field and want to summarize them in a chart:

Data

The two DataFrames share column A:

ROWS = 1000
df = pd.DataFrame.from_dict({'A': np.arange(ROWS),
                             'B': np.random.randint(0, 60, size=ROWS),
                             'C': np.random.randint(0, 100, size=ROWS)})
df.head()

   A   B   C
0  0  10  11
1  1   7  64
2  2  22  12
3  3   1  67
4  4  34  57

And other which I joined as such:

other = pd.DataFrame.from_dict({'A': np.arange(ROWS),
                                'D': np.random.choice(['One', 'Two'], ROWS)})
other.set_index('A', inplace=True)
df = df.join(other, on=['A'], rsuffix='_right')
df.head()

   A   B   C    D
0  0  10  11  One
1  1   7  64  Two
2  2  22  12  One
3  3   1  67  Two
4  4  34  57  One

Question

A proper way to get a column chart with the count of:

C is GTE50 and D is One
C is GTE50 and D is Two
C is LT50 and D is One
C is LT50 and D is Two

Grouped by B, binned into 0, 1-10, 11-20, 21-30, 21-40, 41+.

@piRSquared, but `df.merge(other)` only works because they share the index. For it to work with real data `A` would need to be the index in both DataFrames, right? — Leonel Galán, May 02 '19 at 16:45
`merge` merges on columns. It can merge on indices if you tell it to but by default on columns. If you don't specify which columns, it will choose columns that are common in both dataframes. In this case, only column `'A'` is common to both and so it will merge on that. Please see [@cs95's canonical on merging](https://stackoverflow.com/q/53645882/2336654) — piRSquared, May 02 '19 at 16:47
@piRSquared I'm glad I posted the two DataFrames, I thought about posting the resulting and starting my question there, but I would have miss this learning. Thanks for helping out! — Leonel Galán, May 02 '19 at 16:49

ALollz · Accepted Answer · 2019-05-02T16:36:29.363

6

IIUC, this can be dramatically simplified to a single groupby, taking advantage of clip and np.ceil to form your groups. A single unstack with 2 levels gives us the B-grouping as our x-axis with bars for each D-C combination:

If you want slightly nicer labels, you can map the groupby values:

(df.groupby(['D', 
             df.C.ge(50).map({True: 'GE50', False: 'LT50'}),
             np.ceil(df.B.clip(lower=0, upper=41)/10).map({0: '0', 1: '1-10', 2: '11-20', 3: '21-30', 4: '31-40', 5: '41+'})
            ])
     .size().unstack([0,1]).plot.bar())

Also it's equivalent to group B on:

pd.cut(df['B'],
       bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
       right=False,
       labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])

edited May 02 '19 at 16:36

answered May 02 '19 at 16:20

ALollz

57,915
7
66
89

I've learn so much from your post: I'm going to stick to `pd.cut`, it's built for that; using `np.ceil` and `clip` is very clever, almost too clever. I also like how you used `DataFrameGroupBy.size`. I'll give more time for other answers to come, but this is very good! – Leonel Galán May 02 '19 at 20:10
Yeah, clip and floor/ceil/integer division can be nice from time to time, but are fairly particular. I prefer `size` when you just want the group size, though it is different from `count`, which will ignore not include `null` values in the counts for columns. – ALollz May 02 '19 at 20:13

score 1 · Answer 2 · answered May 02 '19 at 16:10

I arrived to this solution after days of grinding, going back and forth, but there are many things I consider code smells:

groupby returns a sort-of pivot table and melt's purpose is to unpivot data.
The use of dummies for Cx, but not for D? Ultimately they are both categorical data with 2 options. After two days, when I got this first solution I needed a break before trying another branch that treat these two equally.
reset_index, only to set_index lines later. Having to sort_values before set_index
That last summary.unstack().unstack() reads like a big hack.

# %% Cx
df['Cx'] = df['C'].apply(lambda x: 'LT50' if x < 50 else 'GTE50')
df.head()

# %% Bins
df['B_binned'] = pd.cut(df['B'],
                        bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
                        right=False,
                        labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])
df.head()

# %% Dummies
s = df['D']
dummies = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.concat([df, dummies], axis=1)

df.head()

# %% Summary
summary = df.groupby(['B_binned', 'Cx']).agg({'One': 'sum', 'Two': 'sum'})

summary.reset_index(inplace=True)
summary = pd.melt(summary,
                  id_vars=['B_binned', 'Cx'],
                  value_vars=['One', 'Two'],
                  var_name='D',
                  value_name='count')
summary.sort_values(['B_binned', 'D', 'Cx'], inplace=True)
summary.set_index(['B_binned', 'D', 'Cx'], inplace=True)
summary

# %% Chart
summary.unstack().unstack().plot(kind='bar')

You might want to consider using `np.where` instead of your current `apply` implementation, and also looking for an alternative to `apply(pd.Series)` if you want an improvement in performance here — yatu, May 02 '19 at 16:13
@yatu, thanks for the pointers. I want both more readability and better performance, so I'll look into those. I appreciate the help! — Leonel Galán, May 02 '19 at 16:17
Something like `df.groupby(['B_binned', 'Cx','D'])['D'].count().unstack([1,2])` — BENY, May 02 '19 at 16:19

score 1 · Answer 3 · answered May 02 '19 at 17:35

1

Numpy

Using numpy arrays to count then construct the DataFrame to plot

labels = np.array(['0', '1-10', '11-20', '21-30', '31-40', '41+'])
ge_lbl = np.array(['GE50', 'LT50'])

u, d = np.unique(df.D.values, return_inverse=True)
bins = np.array([1, 11, 21, 31, 41]).searchsorted(df.B)
ltge = (df.C.values >= 50).astype(int)

shape = (len(u), len(labels), len(ge_lbl))
out = np.zeros(shape, int)
np.add.at(out, (d, bins, ltge), 1)

pd.concat({
    d_: pd.DataFrame(o, labels, ge_lbl)
    for d_, o in zip(u, out)
}, names=['Cx', 'D'], axis=1).plot.bar()

answered May 02 '19 at 17:35

piRSquared

285,575
57
475
624

Thanks for providing an alternative solutions I'll consider this an advance introduction to **Numpy**. It's a little harder to read than a solution using **pandas** more, but lots to learn from it! – Leonel Galán May 03 '19 at 14:39
1

Absolutely. I'd not advise this over a more user friendly Pandas solution. – piRSquared May 03 '19 at 14:40

score 1 · Answer 4 · answered May 02 '19 at 18:10

Tried a different way of doing it.

df['Bins'] = np.where(df['B'].isin([0]), '0',
            np.where(df['B'].isin(range(1,11)), '1-10',
            np.where(df['B'].isin(range(11,21)), '11-20',
            np.where(df['B'].isin(range(21,31)), '21-30',
            np.where(df['B'].isin(range(31,40)), '31-40','41+')
            ))))

df['Class_type'] = np.where(((df['C']>50) & (df['D']== 'One') ), 'C is GTE50 and D is One',
            np.where(((df['C']>50) & (df['D']== 'Two')) , 'C is GTE50 and D is Two',
            np.where(((df['C']<50) & (df['D']== 'One') ), 'C is LT50 and D is One',
                     'C is LT50 and D is Two')
            ))


df.groupby(['Bins', 'Class_type'])['C'].sum().unstack().plot(kind='bar')
plt.show()

#### Output ####

WARNING: Not sure how optimal the solution is.And also it consumes extra space so space complexity may increase.

In the case of many conditions, you can use [`np.select`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html) to improve readability over several nested `np.where`. Also since the values are numeric I would use syntax like `df.B.between(1, 10)` as the condition — ALollz, May 02 '19 at 19:08
@Preetham I like the idea of being explicit about the "class" and having it on a column. @ALollz, `np.select` seems much better than the nested `np.where`. — Leonel Galán, May 02 '19 at 20:18

Pandas: Help transforming data and writing better code

Data

Question

4 Answers4

Numpy