1

I have a large dataset (20 millions rows). The dataset contains information on where a person live year 2018 and 2019. I wish to write a condition that returns True if the variable 'county" has the same value both year 2018 and 2019 and False if the two values differ. what is the most effective way to acheive this?

df=pd.DataFrame({'id': [10, 10, 20, 20, 30, 30, 40, 40], 'year': [2018, 2019, 2018, 2019, 2018, 2019, 2018, 2019],
    'county' : ['1', '1', '4', '2', '3', '3', '1', '3']})

I aim to create a new column that for id 10 is True (stayer) and for id 20 is False (mover)

Henri
  • 1,077
  • 10
  • 24

1 Answers1

1

For more efective solution dont use lambda function, faster should be compare first and last values per groups like:

g = df.groupby(['id'])['county']
df['newcol'] = g.transform('first').eq(g.transform('last'))
print (df)
   id  year county  newcol
0  10  2018      1    True
1  10  2019      1    True
2  20  2018      4   False
3  20  2019      2   False
4  30  2018      3    True
5  30  2019      3    True
6  40  2018      1   False
7  40  2019      3   False

Another not groupby solution should be more effective:

s = df.set_index(['id','year'])['county']

df['newcol'] = df['id'].map(s.xs(2018, level=1).eq(s.xs(2019, level=1)))
print (df)
   id  year county  newcol
0  10  2018      1    True
1  10  2019      1    True
2  20  2018      4   False
3  20  2019      2   False
4  30  2018      3    True
5  30  2019      3    True
6  40  2018      1   False
7  40  2019      3   False
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • lol no sir I tested on 55k rows..so I don't know about 20 million data.....sorry for the above comment **:)** – Anurag Dabas Jul 08 '21 at 09:50
  • 1
    @AnuragDabas - No, I am not sure, so ask about method. If dont use `concat` for create huge DataFrame then it is good method for test. – jezrael Jul 08 '21 at 09:51