4

I have a Pandas dataframe similar to

d = {'id': [1, 2, 2, 3], 'year': [2010, 2010,2011,2010], 'type' : ['A','B','B','A'], 'value': [20,2,8,3]}
df = pd.DataFrame(data = d)

That is

   id  year type  value
0   1  2010    A     20
1   2  2010    B      2
2   2  2011    B      8
3   3  2010    A      3

I want to add a new column which contains the value one year later for the same id and type, if it exists in the df (0 otherwise). That is, the expected result is

   id  year type  value  new_value
0   1  2010    A     20   0
1   2  2010    B      2   8
2   2  2011    B      8   0
3   3  2010    A      3   0

I cannot figure out a way of doing that (I have been experimenting mostly with apply). Any suggestions?

k88074
  • 2,042
  • 5
  • 29
  • 43
  • Does this answer your question? [Pandas conditional creation of a series/dataframe column](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column) – bcosta12 Jan 31 '20 at 14:53
  • I guess not, since I am not able to state a priori logical conditions, but the logical conditions depend on the specific row. That is, for each row I have to find another row with the same id and type, if it exists. – k88074 Jan 31 '20 at 14:59

3 Answers3

3

You can try merge:

(df.merge(df.assign(year=df['year']-1)[['id','year','value']],
          on=['id','year'],
          how='left',
          suffixes=['','_y'])
   .fillna(0)
)

Output:

   id  year type  value  value_y
0   1  2010    A     20      0.0
1   2  2010    B      2      8.0
2   2  2011    B      8      0.0
3   3  2010    A      3      0.0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
0

You can use the method shift:

# first, you need to sort by id, type and year (already sorted in example)
df = df.sort_values(['id', 'type', 'year'])
cols = ['id', 'type']

mask = (df[cols].shift(-1, fill_value=False) == df[cols]).all(axis=1)
df['value'].shift(-1, fill_value=0).where(mask, 0)

Output:

0    0
1    8
2    0
3    0
Name: value, dtype: int64
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
0

Here is another solution involving dictionary.

# Creating a key column
df['key'] = df[['id','year','type']].astype(str).sum(axis=1)
print(df)
       id  year type  value     key
    0   1  2010    A     20  12010A
    1   2  2010    B      2  22010B
    2   2  2011    B      8  22011B
    3   3  2010    A      3  32010A

Now, creating a dictionary.

# Creating a dictionary
dict_of_columns = dict(zip(df.key, df.value))
print(dict_of_columns)
    {'12010A': 20, '22010B': 2, '22011B': 8, '32010A': 3}

Now, we are creating new year column obtained by adding 1 to every year and creating the corresponding key 'new_value' and creating new keys.

df['next_year']=df['year'] + 1
df['new_value'] = df[['id','next_year','type']].astype(str).sum(axis=1)
print(df)
       id  year type  value     key  next_year new_value
    0   1  2010    A     20  12010A       2011    12011A
    1   2  2010    B      2  22010B       2011    22011B
    2   2  2011    B      8  22011B       2012    22012B
    3   3  2010    A      3  32010A       2011    32011A

Finally, mapping the new key - new_value to the dictionary we have created and dropping the created columns.

df['new_value'] = df['new_value'].map(dict_of_columns).fillna(0)
df = df.drop(['key','next_year'],axis=1)
print(df)
       id  year type  value  new_value
    0   1  2010    A     20        0.0
    1   2  2010    B      2        8.0
    2   2  2011    B      8        0.0
    3   3  2010    A      3        0.0
cph_sto
  • 7,189
  • 12
  • 42
  • 78