I'm using Pandas to populate 6 new variables with values that are conditional to other data variables. The entire dataset consists of about 700,000 rows and 14 variables (columns) including my newly added ones.
My first approach was to use itertuples()
, mainly down to experience being minimal here. This clocked around 9600 seconds.
I've managed to get this more efficient (~3500 seconds) by using apply(). Here is an example of one of the new variables.
housing_df = utils.make_data_frame("data/source_data/housing_with_child.dta", "stata")
""" Populate new variables """
# setup new variables
housing_df["hh_n"] = ""
housing_df["bio_d"] = ""
housing_df["step_d"] = ""
housing_df["child_d"] = ""
housing_df["both_bio"] = ""
housing_df["hhtype"] = ""
housing_df["step"] = ""
# hh_n
def hh_n(row):
df = housing_df.loc[
(housing_df["pidp"] == row["pidp"]) & (housing_df["wave"] == row["wave"])
]
return str(len(df.index) + 1)
housing_df["hh_n"] = housing_df.apply(hh_n, axis=1)
Each new variable follows the same pattern and needs to do the following:
For each row
Get some of the rows existing data (eg pipd and wave)
Find rows in the whole data frame which have the same values (pidp and wave) and put these in a new dataframe
Count the rows we found in the new dataframe
Return the count value for the new variable (hh_n)
Here is a small example of the output data:
In case it's relevant, here is my method for creating a dataframe from the stata file in the first line:
# Method that returns data frame from stata file or csv
def make_data_frame(data, type):
if type is "stata":
reader = pd.read_stata(data, chunksize=100000)
if type is "csv":
reader = pd.read_csv(data, chunksize=100000)
df = pd.DataFrame()
for itm in reader:
df = df.append(itm)
return df
Here is an example of the first 50 rows of data
{'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99}, 'pidp': {0: 1367, 1: 2051, 2: 2727, 3: 2727, 4: 2727, 5: 2727, 6: 2727, 7: 3407, 8: 3407, 9: 3407, 10: 3407, 11: 3407, 12: 3407, 13: 3407, 14: 3407, 15: 3407, 16: 3407, 17: 3407, 18: 3407, 19: 3407, 20: 3407, 21: 3407, 22: 3407, 23: 3407, 24: 3407, 25: 3407, 26: 3407, 27: 3407, 28: 3407, 29: 4091, 30: 4091, 31: 4091, 32: 4091, 33: 4091, 34: 4091, 35: 4091, 36: 4091, 37: 4767, 38: 4767, 39: 4767, 40: 4767, 41: 4767, 42: 4767, 43: 5451, 44: 5451, 45: 5451, 46: 5451, 47: 5451, 48: 5451, 49: 6135, 50: 6135, 51: 6135, 52: 6135, 53: 6135, 54: 6807, 55: 6807, 56: 6844, 57: 6844, 58: 6844, 59: 6844, 60: 6844, 61: 6844, 62: 6844, 63: 6844, 64: 6844, 65: 6844, 66: 6844, 67: 6844, 68: 6844, 69: 6844, 70: 6844, 71: 6844, 72: 6844, 73: 6844, 74: 6844, 75: 6844, 76: 6844, 77: 6844, 78: 6844, 79: 6844, 80: 6844, 81: 6844, 82: 6844, 83: 6844, 84: 6844, 85: 6844, 86: 6844, 87: 6844, 88: 6844, 89: 6844, 90: 6844, 91: 6844, 92: 6844, 93: 6844, 94: 6844, 95: 6844, 96: 6844, 97: 6844, 98: 6844, 99: 6844}, 'hidp': {0: 20402, 1: 20402, 2: 5799043, 3: 60159608, 4: 45594010, 5: 53475212, 6: 15449616, 7: 102002, 8: 102002, 9: 30559202, 10: 30559202, 11: 6351204, 12: 6351204, 13: 2590808, 14: 13610, 15: 6812, 16: 13614, 17: 20416, 18: 21902818, 19: 36407220, 20: 36407220, 21: 20424, 22: 20424, 23: 20426, 24: 20426, 25: 9010028, 26: 9010028, 27: 18856430, 28: 18856430, 29: 136002, 30: 136002, 31: 30566002, 32: 30566002, 33: 6358004, 34: 6358004, 35: 20406, 36: 20406, 37: 210802, 38: 210802, 39: 30579602, 40: 30579602, 41: 30579602, 42: 6371604, 43: 210802, 44: 210802, 45: 30579602, 46: 30579602, 47: 30579602, 48: 6371604, 49: 210802, 50: 210802, 51: 30579602, 52: 30579602, 53: 30579602, 54: 285602, 55: 30593202, 56: 37073606, 57: 37073606, 58: 37073606, 59: 37073606, 60: 37073606, 61: 12580008, 62: 12580008, 63: 12580008, 64: 12580008, 65: 12580008, 66: 56052408, 67: 56052408, 68: 56052408, 69: 56052408, 70: 56052408, 71: 38848410, 72: 38848410, 73: 38848410, 74: 38848410, 75: 38848410, 76: 50014012, 77: 50014012, 78: 50014012, 79: 5467216, 80: 5467216, 81: 5467216, 82: 25547618, 83: 25547618, 84: 25547618, 85: 38610420, 86: 38610420, 87: 38610420, 88: 5025224, 89: 5025224, 90: 5025224, 91: 4637626, 92: 4637626, 93: 4637626, 94: 12784028, 95: 12784028, 96: 12784028, 97: 21719230, 98: 21719230, 99: 21719230}, 'apidp': {0: 2051, 1: 1367, 2: 752052125, 3: 752052125, 4: 752052125, 5: 752052125, 6: 752052125, 7: 545740805, 8: 612001365, 9: 545740805, 10: 612001365, 11: 545740805, 12: 612001365, 13: 612001365, 14: 612001365, 15: 612001365, 16: 612001365, 17: 612001365, 18: 612001365, 19: 612001365, 20: 612001369, 21: 612001365, 22: 612001369, 23: 612001365, 24: 612001369, 25: 612001365, 26: 612001369, 27: 612001365, 28: 612001369, 29: 68002045, 30: 68002049, 31: 68002045, 32: 68002049, 33: 68002045, 34: 68002049, 35: 68002045, 36: 68002049, 37: 5451, 38: 6135, 39: 5451, 40: 6135, 41: 5298579, 42: 5451, 43: 4767, 44: 6135, 45: 4767, 46: 6135, 47: 5298579, 48: 4767, 49: 4767, 50: 5451, 51: 4767, 52: 5451, 53: 5298579, 54: 4885131, 55: 4885131, 56: 13644, 57: 748469885, 58: 748469889, 59: 749992405, 60: 749992409, 61: 13644, 62: 748469885, 63: 748469889, 64: 749992405, 65: 749992409, 66: 13644, 67: 748469885, 68: 748469889, 69: 749992405, 70: 749992409, 71: 13644, 72: 748469885, 73: 748469889, 74: 749992405, 75: 749992409, 76: 13644, 77: 748469885, 78: 748469889, 79: 13644, 80: 748469885, 81: 748469889, 82: 13644, 83: 748469885, 84: 748469889, 85: 13644, 86: 748469885, 87: 748469889, 88: 13644, 89: 748469885, 90: 748469889, 91: 13644, 92: 748469885, 93: 748469889, 94: 13644, 95: 748469885, 96: 748469889, 97: 13644, 98: 748469885, 99: 748469889}, 'wave': {0: 2.0, 1: 2.0, 2: 2.0, 3: 8.0, 4: 9.0, 5: 10.0, 6: 11.0, 7: 2.0, 8: 2.0, 9: 3.0, 10: 3.0, 11: 4.0, 12: 4.0, 13: 7.0, 14: 8.0, 15: 9.0, 16: 10.0, 17: 11.0, 18: 12.0, 19: 13.0, 20: 13.0, 21: 14.0, 22: 14.0, 23: 15.0, 24: 15.0, 25: 16.0, 26: 16.0, 27: 17.0, 28: 17.0, 29: 2.0, 30: 2.0, 31: 3.0, 32: 3.0, 33: 4.0, 34: 4.0, 35: 5.0, 36: 5.0, 37: 2.0, 38: 2.0, 39: 3.0, 40: 3.0, 41: 3.0, 42: 4.0, 43: 2.0, 44: 2.0, 45: 3.0, 46: 3.0, 47: 3.0, 48: 4.0, 49: 2.0, 50: 2.0, 51: 3.0, 52: 3.0, 53: 3.0, 54: 2.0, 55: 3.0, 56: 6.0, 57: 6.0, 58: 6.0, 59: 6.0, 60: 6.0, 61: 7.0, 62: 7.0, 63: 7.0, 64: 7.0, 65: 7.0, 66: 8.0, 67: 8.0, 68: 8.0, 69: 8.0, 70: 8.0, 71: 9.0, 72: 9.0, 73: 9.0, 74: 9.0, 75: 9.0, 76: 10.0, 77: 10.0, 78: 10.0, 79: 11.0, 80: 11.0, 81: 11.0, 82: 12.0, 83: 12.0, 84: 12.0, 85: 13.0, 86: 13.0, 87: 13.0, 88: 14.0, 89: 14.0, 90: 14.0, 91: 15.0, 92: 15.0, 93: 15.0, 94: 16.0, 95: 16.0, 96: 16.0, 97: 17.0, 98: 17.0, 99: 17.0}, 'rel': {0: 'unrelated sharer', 1: 'unrelated sharer', 2: 'natural child', 3: 'natural child', 4: 'natural child', 5: 'natural child', 6: 'natural child', 7: 'lawful spouse', 8: 'natural child', 9: 'lawful spouse', 10: 'natural child', 11: 'lawful spouse', 12: 'natural child', 13: 'natural child', 14: 'natural child', 15: 'natural child', 16: 'natural child', 17: 'natural child', 18: 'natural child', 19: 'natural child', 20: 'daughter/son-in-law', 21: 'natural child', 22: 'daughter/son-in-law', 23: 'natural child', 24: 'daughter/son-in-law', 25: 'natural child', 26: 'daughter/son-in-law', 27: 'natural child', 28: 'daughter/son-in-law', 29: 'lawful spouse', 30: 'natural child', 31: 'lawful spouse', 32: 'natural child', 33: 'lawful spouse', 34: 'natural child', 35: 'lawful spouse', 36: 'natural child', 37: 'unrelated sharer', 38: 'natural child', 39: 'other', 40: 'natural child', 41: 'daughter/son-in-law', 42: 'other', 43: 'unrelated sharer', 44: 'natural child', 45: 'other', 46: 'natural child', 47: 'daughter/son-in-law', 48: 'other', 49: 'natural parent', 50: 'natural parent', 51: 'natural parent', 52: 'natural parent', 53: 'lawful spouse', 54: 'natural brother/sister', 55: 'natural brother/sister', 56: 'natural child', 57: 'lawful spouse', 58: 'natural child', 59: 'mother/father-in-law', 60: 'mother/father-in-law', 61: 'natural child', 62: 'lawful spouse', 63: 'natural child', 64: 'mother/father-in-law', 65: 'mother/father-in-law', 66: 'natural child', 67: 'lawful spouse', 68: 'natural child', 69: 'mother/father-in-law', 70: 'mother/father-in-law', 71: 'natural child', 72: 'lawful spouse', 73: 'natural child', 74: 'mother/father-in-law', 75: 'mother/father-in-law', 76: 'natural child', 77: 'lawful spouse', 78: 'natural child', 79: 'natural child', 80: 'lawful spouse', 81: 'natural child', 82: 'natural child', 83: 'lawful spouse', 84: 'natural child', 85: 'natural child', 86: 'lawful spouse', 87: 'natural child', 88: 'natural child', 89: 'lawful spouse', 90: 'natural child', 91: 'natural child', 92: 'lawful spouse', 93: 'natural child', 94: 'natural child', 95: 'lawful spouse', 96: 'natural child', 97: 'natural child', 98: 'lawful spouse', 99: 'natural child'}, 'is_child': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '1', 9: '', 10: '1', 11: '', 12: '1', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '1', 31: '', 32: '1', 33: '', 34: '1', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '1', 57: '', 58: '1', 59: '', 60: '', 61: '1', 62: '', 63: '1', 64: '', 65: '', 66: '1', 67: '', 68: '1', 69: '', 70: '', 71: '1', 72: '', 73: '1', 74: '', 75: '', 76: '1', 77: '', 78: '1', 79: '1', 80: '', 81: '1', 82: '1', 83: '', 84: '1', 85: '1', 86: '', 87: '1', 88: '1', 89: '', 90: '1', 91: '1', 92: '', 93: '1', 94: '1', 95: '', 96: '1', 97: '1', 98: '', 99: '1'}, 'hh_n': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'bio_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'child_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'both_bio': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'hhtype': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}}