id | dob | status | sex | work_code | work_cat | gross_income | dependent_cat | dependent_dob | preparer | city | new_id |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1981 | Married | MALE | 1111 | Government | 15000 | U | 2001 | Taxpayer | BOSTON | 1 |
2 | 1981 | Married | MALE | 1111 | Government | 15000 | U | 2002 | Taxpayer | BOSTON | 1 |
3 | 1983 | SINGLE | MALE | 2222 | Federal | 20000 | N | 2009 | Taxpayer | MIAMI | 2 |
4 | 1983 | SINGLE | MALE | 2222 | Federal | 20000 | N | 2010 | Taxpayer | MIAMI | 2 |
5 | 1975 | SINGLE | MALE | 2222 | Federal | 12000 | N/A | - | Taxpayer | NYC | 3 |
6 | 1983 | Married | MALE | 2222 | Federal | 18000 | U | 2013 | Taxpayer | NYC | 4 |
7 | 1981 | Married | MALE | 1111 | Government | 12000 | N | 2005 | Taxpayer | PORLAND | 5 |
8 | 1981 | Married | MALE | 1111 | Government | 15000 | N/A | - | Taxpayer | BOSTON | 6 |
About the data
I've a .txt dataset with 1.6m rows that looks like this, sans the new_id column which I want to generate. Each row is a taxfiller and their dependent (unless dependent_cat is N/A). When a row is followed by an identical one (sans the two dependent_X columns), it is assumed that this is in fact the same household, but a different dependent within the household (and thus, I wish to generate new_id = 1 for both, so I can later collapse the dataset such that each row is a household, and there is a column that counts the dependents and whatnot).
What I've done
I have in fact generated household ID's in the real dataset, but I did not manage to incorporate the 'row proximity'. Turns out its quite important - got too many cases like the example of #1,2,8 being bunched up into the same household when they shouldn't be. Got too many 0 income households with a count of >10 dependents due to this.
What I propose
I've tried to think this through, and I came up with:
looping through each row, assigning a new ID if dependent_cat is N/A, otherwise, verify if the previous row is identical in all columns (except dependent_cat and dependent_dob) if so, copy its new_id otherwise, new_id = last assigned id +1
Why my proposal sucks - looking for alternatives
However I know that looping through 1.6m dataframe rows is not very pythonic. I might try it, but relying on it would be continuing my lazy approach to Python.
Is there any way that hashing could help me here? I've only found hashing which would make id #1 and id #8 end up with the same new_id - leaving me at my current status. I'd need hashing to 'reset' when it passes id #2, such that it doesn't assign id #8 their same new_id. I would also prefer, but not require, new_id to be cardinal numbers as I've shown (such that max new_id, in the end, is the number of households in my dataset).
Related question: Pandas - Generate Unique ID based on row values