1

I have a dataframe column with numbering like this:

| week     |
| -------- |
| 10       |
| 11       |
| 12       |
| 14       |
| 15       |
| 16       |
| 18       |
| 19       |
| 20       |
| 21       |

and I wanna get this (a new column 'nums' with sequential numbering on contiguous blocks of 'week'):

| week     | nums           |
| -------- | -------------- |
| 10       | 0              |
| 11       | 1              |
| 12       | 2              |
| 14       | 0              |
| 15       | 1              |
| 16       | 2              |
| 18       | 0              |
| 19       | 1              |
| 20       | 2              |
| 21       | 3              |
| ...      | ...            |

Any ideas how to realize it as fast as possible without iteration because it's over a million rows.

And then I need a) just to count each that block of nums column and b) calculate median of 'nums' for each block

smci
  • 32,567
  • 20
  • 113
  • 146
Tyrell
  • 11
  • 2
  • A more accurate title is *"add column with sequential numbering for contiguous weeks"*. – smci Jul 11 '21 at 20:53
  • "without iteration because it's over million strings" Surely the dtype of 'week' is int, not string? If not, change the dtype in your pd.read_csv(). Likewise change all your dtypes to read int/float in as int/float, not string. Tell us how the performance improved. – smci Jul 11 '21 at 20:53
  • What does *"a) just to count each that block of nums column and b) calculate median of each block"* mean in your example? 3 blocks? Do you want the median of 'week' or 'nums' for each block? – smci Jul 11 '21 at 20:57
  • Sure, dtype is int. RIght, median of 'nums' – Tyrell Jul 11 '21 at 21:13
  • An old near-duplicate: [Counting consecutive positive values in Python/pandas array](https://stackoverflow.com/questions/27626542/counting-consecutive-positive-values-in-python-pandas-array) – smci Jul 11 '21 at 21:23
  • Tyrell: if you only want the median of 'nums' it will simply be (length of block)/2. You can get that directly with `df.groupby((df['week'].diff().fillna(0) != 1).cumsum()).size()/2`. You don't need to compute the numbering for each row. – smci Jul 11 '21 at 22:10
  • *"sequence of numbers from 0 to X"* is simply called *consecutive numbering*. And we don't even need to define X, we can simply say *consecutive numbering within groups*. – smci Jul 11 '21 at 23:06

1 Answers1

1

Let's try groupby cumcount on groups created from where the difference between rows does not equal 1:

df['nums'] = df.groupby(df['week'].diff().ne(1).cumsum()).cumcount()

df:

   week  nums
0    10     0
1    11     1
2    12     2
3    14     0
4    15     1
5    16     2
6    18     0
7    19     1
8    20     2
9    21     3

The Series can be reused and count and median can be calculated with groupby transform:

s = df['week'].diff().ne(1).cumsum()
df['nums'] = df.groupby(s).cumcount()
# Group and transform based on nums (save the grouper for re-use)
g = df.groupby(s)['nums']
df['count'] = g.transform('count')
df['median'] = g.transform('median')
   week  nums  count  median
0    10     0      3     1.0
1    11     1      3     1.0
2    12     2      3     1.0
3    14     0      3     1.0
4    15     1      3     1.0
5    16     2      3     1.0
6    18     0      4     1.5
7    19     1      4     1.5
8    20     2      4     1.5
9    21     3      4     1.5

Groups created with diff:

df['week'].diff().ne(1).cumsum()
0    1
1    1
2    1
3    2
4    2
5    2
6    3
7    3
8    3
9    3
Name: week, dtype: int32
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57