1

I have a pretty large dataframe (20M rows) containing workers. Some of the workers appear multiple times, given they might have previously worked in other firms. For each worker, I have information about first, last name (columns: first_name, last_name) and the starting date of each job (started_working_date). I'd like to create a new column in the dataframe (past_experience) that flags workers that were previously employed (in any firm).

In order to do so, I have to check whether, for each worker in my dataframe, there is a row (in the same dataframe) with the same first and last name and a earlier starting date than the one considered.

The dataframe looks like this:

   first_name  last_name started_working_date
0  Bob         J         1995-01-01
1  John        S         2000-01-01
1  Mark        L         2001-01-01
1  Bob         J         1997-01-01

The ideal result should be:

   first_name  last_name  started_working_date  past_experience
0  Bob         J          1995-01-01            0
1  John        S          2000-01-01            0
1  Mark        L          2001-01-01            0
1  Bob         J          1997-01-01            1

I tried to write a very simple apply function that filters the dataframe according to my conditions and returns 0 or 1 depending on the length of the resulting dataframe.

def past_experience(row):
    filtered_df = my_df[(my_df['first_name'] == row['first_name']) & (my_df['last_name'] == row['last_name']) & (my_df['started_working_date'] < row['started_working_date'])]
    if filtered_df.shape[0]>0:
        return 1
    else:
        return 0

my_df['past_experience'] = my_df.apply(past_experience, axis=1)

This works, but it is extremely inefficient. Could you suggest a better solution?

user2447387
  • 173
  • 1
  • 3
  • 12
  • can you add a dummy data(small one) and describe your needs and the expected output please? Thanks. For help refer this: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – anky Apr 18 '19 at 15:02
  • 1
    I've added it, sorry about that – user2447387 Apr 18 '19 at 15:11
  • so what about `df.duplicated(['first_name','last_name']).astype(int)` ? – anky Apr 18 '19 at 15:28

2 Answers2

2

I am using numpy broadcast , notice this method still a o(n*n) check , which mean if you data frame is big, memory will exceed.

s1=my_df['first_name'].values
s2=my_df['last_name'].values
s3=my_df['started_working_date'].values

np.any((s1==s1[:,None])&(s2==s2[:,None])&(s3<s3[:,None]),1)
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Try:

groups = df.groupby(['first_name','last_name'])
df['employed'] = groups.started_working_date.cumcount()

df['employed_shift'] = groups.employed.shift().fillna(0)

df['employed_changed'] = (df['employed'] != df['employed_shift'])

And output:

+---+------------+-----------+----------------------+----------+----------------+
|   | first_name | last_name | started_working_date | employed | employ_changed |
+---+------------+-----------+----------------------+----------+----------------+
| 0 | Bob        | J         | 1995-01-01           |        0 | False          |
| 1 | John       | S         | 2000-01-01           |        0 | False          |
| 1 | Mark       | L         | 2001-01-01           |        0 | False          |
| 1 | Bob        | J         | 1997-01-01           |        1 | True           |
+---+------------+-----------+----------------------+----------+----------------+
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74