Condition if a variable value is the same diffrent years, Python/Pandas. Fastest solution?

Question

I have a large dataset (20 millions rows). The dataset contains information on where a person live year 2018 and 2019. I wish to write a condition that returns True if the variable 'county" has the same value both year 2018 and 2019 and False if the two values differ. what is the most effective way to acheive this?

df=pd.DataFrame({'id': [10, 10, 20, 20, 30, 30, 40, 40], 'year': [2018, 2019, 2018, 2019, 2018, 2019, 2018, 2019],
    'county' : ['1', '1', '4', '2', '3', '3', '1', '3']})

I aim to create a new column that for id 10 is True (stayer) and for id 20 is False (mover)

Is possible test perfromance of both solutions in real data? — jezrael, Jul 08 '21 at 09:22
@jezreal The set_index-method: 38.3 s +- 532 ms per loop, the g.transform-method: 42.2 s +- 1.63 s per loop. And the lambda-metod: Still no result. — Henri, Jul 08 '21 at 11:00
If I scale down the dataset to 8000 rows the result are pretty clear. Lamba-metod measures in at 3.52 s compared to 9 ms for set_index method. I didn't realise before the performance differences. Thanks a lot. — Henri, Jul 08 '21 at 11:09

jezrael · Accepted Answer · 2021-07-08T09:27:06.540

For more efective solution dont use lambda function, faster should be compare first and last values per groups like:

g = df.groupby(['id'])['county']
df['newcol'] = g.transform('first').eq(g.transform('last'))
print (df)
   id  year county  newcol
0  10  2018      1    True
1  10  2019      1    True
2  20  2018      4   False
3  20  2019      2   False
4  30  2018      3    True
5  30  2019      3    True
6  40  2018      1   False
7  40  2019      3   False

Another not groupby solution should be more effective:

s = df.set_index(['id','year'])['county']

df['newcol'] = df['id'].map(s.xs(2018, level=1).eq(s.xs(2019, level=1)))
print (df)
   id  year county  newcol
0  10  2018      1    True
1  10  2019      1    True
2  20  2018      4   False
3  20  2019      2   False
4  30  2018      3    True
5  30  2019      3    True
6  40  2018      1   False
7  40  2019      3   False

lol no sir I tested on 55k rows..so I don't know about 20 million data.....sorry for the above comment **:)** — Anurag Dabas, Jul 08 '21 at 09:50
@AnuragDabas - No, I am not sure, so ask about method. If dont use `concat` for create huge DataFrame then it is good method for test. — jezrael, Jul 08 '21 at 09:51

Condition if a variable value is the same diffrent years, Python/Pandas. Fastest solution?

1 Answers1