0

I have panel data (repeated observations per ID at different points in time). Data is unbalanced (there are gaps). I need to check and possibly adjust for a change in variable per person over the years.

I tried two versions. First, a for loop-setting, to first access each person and each of its years. Second, a one line combination with groupby. Groupby looks more elegant to me. Here the main issue is to identify the "next element". I assume in a loop I can solve this with a counter.

Here is my MWE panel data:

import pandas as pd
df = pd.DataFrame({'year': ['2003', '2004', '2005', '2006', '2007', '2008', '2009','2003', '2004', '2005', '2006', '2007', '2008', '2009'],
                   'id': ['1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2'],
                   'money': ['15', '15', '15', '16', '16', '16', '16', '17', '17', '17', '18', '17', '17', '17']}).astype(int)
df

Here is what a time series per person looks like:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

fig, ax = plt.subplots()

for i in df.id.unique():
    df[df['id']==i].plot.line(x='year', y='var', ax=ax, label='id = %s'%i)
    df[df['id']==i].plot.scatter(x='year', y='var', ax=ax)
    plt.xticks(np.unique(df.year),rotation=45)    

enter image description here

Here is what I want to achieve: For each person, compare the time series of values and drop every successor who is different from its precursor value (identify red circles). Then I will try different strategies to handle it:

  • Drop (very iffy): if successor differs, drop it
  • Smooth (absolute value): if successor differs by (say) 1 unit, assign it its precursor value
  • Smooth (relative value): if successor differs by (say) 1 percent, assign it its precursor value

Solution to drop

df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
df_new = df.drop(df[df['money_difference'].abs()>0].index)

Idea to smooth

# keep track of change of variable by person and time
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
# first element has no precursor, it will be NaN, replace this by 0
df = df.fillna(0)
# now: whenever change_of_variable exceeds a threshold, replace the value by its precursor - not working so far
df['money'] = np.where(abs(df['money_difference'])>=1, df['money'].shift(1), df['money'])
Marco
  • 2,368
  • 6
  • 22
  • 48

1 Answers1

1

To get the next event in your database you can use a combination with groupby and shift and then do the subraction to the previos event:

df['money_difference'] =df.groupby(['year', 'id'])['money'].shift(-1)-df['money']
PV8
  • 5,799
  • 7
  • 43
  • 87
  • That's a good function. Wasn't aware of `shift`. But I changed the year component, since it gave me only NAN when I compare for all person-year-combination (there is only 1 of these maximum) `df['money_difference'] = df.groupby('id')['money'].shift(-1)-df['money']` Still also differences are reported "one year too early" and I get NAN for the last (because it has no succesor, which is ok I guess). – Marco Jun 18 '19 at 06:47
  • either your last or your first entry will be empty, i think shift gives your more possibilites, I also learned shift here – PV8 Jun 18 '19 at 06:48
  • Can you also give me a hint how to include a `replace` function when I smooth the data instead of dropping it? Something like this `df['money'].replace(df['money_difference'].abs()>1, df['money'].shift(1))` – Marco Jun 18 '19 at 09:18
  • What part you want to smooth? Do you need an if condition for your column? this can be done with np.where, in the ocmments here is a good example: https://stackoverflow.com/questions/52270185/while-loop-alternative-in-python – PV8 Jun 18 '19 at 09:23
  • I will check it out. In the MWE data, picture ID=2 I need the option to smooth the peak at year=2006. It might be considered an outlier, misreporting or something similar. So when the deviation is small, I would like to replace it with its precursor value. – Marco Jun 18 '19 at 09:29
  • so for example: `df['money'] = np.where(abs(df['money_difference'])>=1, df.groupby(['year', 'id'])['money'].shift(-1), df['money']) ` – PV8 Jun 18 '19 at 09:36
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195123/discussion-between-marco-doe-and-pv8). – Marco Jun 18 '19 at 11:57