1

I have a dataframe where values have been assigned to groups:

import pandas as pd

df = pd.DataFrame({ 'num' : [0.43, 5.2, 1.3, 0.33, .74, .5, .2, .12],
                   'group' : [1, 2, 2, 2, 3,4,5,5]
                    })

df

  group num
0   1   0.43
1   2   5.20
2   2   1.30
3   2   0.33
4   3   0.74
5   4   0.50
6   5   0.20
7   5   0.12

I would like to ensure that no value is in a group alone. If a value is an "orphan", it should be reassigned to the next highest group with more than one member. So the resultant dataframe should look like this instead:

  group num
0   2   0.43
1   2   5.20
2   2   1.30
3   2   0.33
4   5   0.74
5   5   0.50
6   5   0.20
7   5   0.12

What's the most pythonic way to achieve this result?

edge-case
  • 1,128
  • 2
  • 14
  • 32

2 Answers2

1

Here is one solution I found, there may be much better ways to do this...

# Find the orphans
count = df.group.value_counts().sort_index()
orphans = count[count == 1].index.values.tolist()

# Find the sets
sets = count[count > 1].index.values.tolist()

# Find where orphans should be remapped
where = [bisect.bisect(sets, x) for x in orphans]
remap = [sets[x] for x in where]

# Create a dictionary for remapping, and replace original values
change = dict(zip(orphans, remap))
df = df.replace({'group': change})

df

  group num
0   2   0.43
1   2   5.20
2   2   1.30
3   2   0.33
4   5   0.74
5   5   0.50
6   5   0.20
7   5   0.12
edge-case
  • 1,128
  • 2
  • 14
  • 32
  • Side note: I recommend `pd.Series.map` instead of `pd.Series.replace` for mapping a column via a dictionary. See [Replace values in a pandas series via dictionary efficiently](https://stackoverflow.com/questions/49259580/replace-values-in-a-pandas-series-via-dictionary-efficiently). – jpp Jul 26 '18 at 11:02
1

It is possible to use only vectorised operations for this task. You can use pd.Series.bfill to create a mapping from your original index to a new one:

counts = df['group'].value_counts().sort_index().reset_index()
counts['original'] = counts['index']
counts.loc[counts['group'] == 1, 'index'] = np.nan
counts['index'] = counts['index'].bfill().astype(int)

print(counts)

   index  group  original
0      2      1         1
1      2      3         2
2      5      1         3
3      5      1         4
4      5      2         5

Then use pd.Series.map to perform your mapping:

df['group'] = df['group'].map(counts.set_index('original')['index'])

print(df)

   group   num
0      2  0.43
1      2  5.20
2      2  1.30
3      2  0.33
4      5  0.74
5      5  0.50
6      5  0.20
7      5  0.12
jpp
  • 159,742
  • 34
  • 281
  • 339
  • This is really interesting, I did not know about `pd.Series.bfill()`. However, I find that this method is a bit slower. timeit results: `5.01 ms ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)` compared to the method with lists: `1.68 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)`. Are there other reasons why using only vectorized operations would be better? – edge-case Jul 26 '18 at 13:56