Pandas - Generate ID based on row values AND proximity of rows

Question

id	dob	status	sex	work_code	work_cat	gross_income	dependent_cat	dependent_dob	preparer	city	new_id
1	1981	Married	MALE	1111	Government	15000	U	2001	Taxpayer	BOSTON	1
2	1981	Married	MALE	1111	Government	15000	U	2002	Taxpayer	BOSTON	1
3	1983	SINGLE	MALE	2222	Federal	20000	N	2009	Taxpayer	MIAMI	2
4	1983	SINGLE	MALE	2222	Federal	20000	N	2010	Taxpayer	MIAMI	2
5	1975	SINGLE	MALE	2222	Federal	12000	N/A	-	Taxpayer	NYC	3
6	1983	Married	MALE	2222	Federal	18000	U	2013	Taxpayer	NYC	4
7	1981	Married	MALE	1111	Government	12000	N	2005	Taxpayer	PORLAND	5
8	1981	Married	MALE	1111	Government	15000	N/A	-	Taxpayer	BOSTON	6

About the data

I've a .txt dataset with 1.6m rows that looks like this, sans the new_id column which I want to generate. Each row is a taxfiller and their dependent (unless dependent_cat is N/A). When a row is followed by an identical one (sans the two dependent_X columns), it is assumed that this is in fact the same household, but a different dependent within the household (and thus, I wish to generate new_id = 1 for both, so I can later collapse the dataset such that each row is a household, and there is a column that counts the dependents and whatnot).

What I've done

I have in fact generated household ID's in the real dataset, but I did not manage to incorporate the 'row proximity'. Turns out its quite important - got too many cases like the example of #1,2,8 being bunched up into the same household when they shouldn't be. Got too many 0 income households with a count of >10 dependents due to this.

What I propose

I've tried to think this through, and I came up with:

looping through each row, assigning a new ID if dependent_cat is N/A, otherwise, verify if the previous row is identical in all columns (except dependent_cat and dependent_dob) if so, copy its new_id otherwise, new_id = last assigned id +1

Why my proposal sucks - looking for alternatives

However I know that looping through 1.6m dataframe rows is not very pythonic. I might try it, but relying on it would be continuing my lazy approach to Python.

Is there any way that hashing could help me here? I've only found hashing which would make id #1 and id #8 end up with the same new_id - leaving me at my current status. I'd need hashing to 'reset' when it passes id #2, such that it doesn't assign id #8 their same new_id. I would also prefer, but not require, new_id to be cardinal numbers as I've shown (such that max new_id, in the end, is the number of households in my dataset).

Related question: Pandas - Generate Unique ID based on row values

Not sure I understand. The ID you generate is not unique since it repeats itself in some cases... Maybe you want to group by all columns except `dependent_cat` and `dependent_dob`? In that case you could aggregate your rows, keeping only one entry per household — Tranbi, Jan 19 '22 at 06:44
Apologies, will edit for clarity. It is indeed not a unique ID. It is my intension to make it so which is why I got mixed up - after I am done with this, I want to convert the dataset s.t. every row is a household, not a person. I cannot simply do a group-by - a simple groupby would put my example table's id #1,2,8 under the same household, when really #1,2 is a household and #8 is another. I know this simply because the last id of the previous is not immediatelly before the next. Don't know how to implement that programming-wise though. — Gabriel Rios-Perez, Jan 19 '22 at 11:22

score 1 · Accepted Answer · answered Jan 19 '22 at 07:02

(i) Create a list from the column names that are relevant to identify households (basically, dropped "id", "dependent_dob" since one is unique for every entry and the other is the personal info of the dependent).

(ii) use shift to shift index by 1 and see if consecutive rows match for the columns identified in (i). The idea is, if there is any difference, then it's a different household; otherwise the same household. This will create a boolean Series.

(iii) Use cumsum on the outcome of (ii) and assign the outcome to 'new_id' column.

dep_id_cols = ['dob', 'status', 'sex', 'work_code', 'work_cat', 
               'gross_income', 'dependent_cat', 'preparer', 'city']
df['new_id'] = df[dep_id_cols].ne(df[dep_id_cols].shift()).any(axis=1).cumsum()

Output:

   id   dob   status   sex  work_code    work_cat  gross_income dependent_cat  \
0   1  1981  Married  MALE       1111  Government         15000             U   
1   2  1981  Married  MALE       1111  Government         15000             U   
2   3  1983   SINGLE  MALE       2222     Federal         20000             N   
3   4  1983   SINGLE  MALE       2222     Federal         20000             N   
4   5  1975   SINGLE  MALE       2222     Federal         12000           NaN   
5   6  1983  Married  MALE       2222     Federal         18000             U   
6   7  1981  Married  MALE       1111  Government         12000             N   
7   8  1981  Married  MALE       1111  Government         15000           NaN   

  dependent_dob  preparer     city  new_id  
0          2001  Taxpayer   BOSTON       1  
1          2002  Taxpayer   BOSTON       1  
2          2009  Taxpayer    MIAMI       2  
3          2010  Taxpayer    MIAMI       2  
4             -  Taxpayer      NYC       3  
5          2013  Taxpayer      NYC       4  
6          2005  Taxpayer  PORLAND       5  
7             -  Taxpayer   BOSTON       6

Beautiful. I didn't expect help so fast much less so elegant. Seems to have worked perfectly. Moving on to grouping these rows together and preserving dependent counts, types, and dob. Hopefully that's smoother. — Gabriel Rios-Perez, Jan 19 '22 at 11:37
As mentioned, this answer was great - but I noticed an issue w/ rows which have no dependents, are consecutive, and have matching columns. They are getting lumped up as the same unit, but are actually different people. How could I adress this? I thought of running your answer on the slice of my dataframe which has dependents. Then, on the slice that does not, I would assign IDs starting from the last one of the previous slice, incrementing by one each row. Then append the slices back into one dataframe. — Gabriel Rios-Perez, Jan 20 '22 at 23:19
@GabrielRios-Perez Given your description of how to identify dependents, if different households may also have matching column values, then I don't know how to identify them as separate. This is beyond coding and where actual domain knowledge comes in (which I obviously know nothing about). — , Jan 21 '22 at 05:17

Pandas - Generate ID based on row values AND proximity of rows

About the data

What I've done

What I propose

Why my proposal sucks - looking for alternatives

1 Answers1