I have 6 pandas dataframes (Patients, Test1, Test2, Test3, Test4, Test5) linked by an ID key.
Each row in the Patients dataframe represents a patient containing a unique ID there are 200000+ patients/rows.
Each row in the Test dataframes represents a test result on a day. The columns for the Test dataframes are ID, DATE, TEST_UNIT, TEST_RESULT. Each of the Test dataframes contains between 6,000,000 to 7,000,000 rows.
I want to loop through all the IDs in the Patients dataframe and in each iteration use the ID to extract relevant test data from each of the 5 Test dataframes and do some processing on them.
If I do
for i in range(len(Patients)):
ind_id = Patients.ID.iloc[i]
ind_test1 = Test1[Test1['ID'] == ind_id]
ind_test2 = Test2[Test2['ID'] == ind_id]
ind_test3 = Test3[Test3['ID'] == ind_id]
ind_test4 = Test4[Test4['ID'] == ind_id]
ind_test3 = Test5[Test5['ID'] == ind_id]
It takes about 3.6 seconds per iteration.
When I tried to speed it up by using the Numpy interface.
Patients_v = Patients.values
Test1_v = Test1.values
Test2_v = Test2.values
Test3_v = Test3.values
Test4_v = Test4.values
Test5_v = Test5.values
for i in range(len(Patients_v)):
ind_id = Patients_v[i, ID_idx]
ind_test1 = Test1_v[Test1_v[:, 0] == ind_id]
ind_test2 = Test2_v[Test2_v[:, 0] == ind_id]
ind_test3 = Test3_v[Test3_v[:, 0] == ind_id]
ind_test4 = Test4_v[Test4_v[:, 0] == ind_id]
ind_test5 = Test5_v[Test5_v[:, 0] == ind_id]
It takes about 0.9 seconds per iteration.
How can I speed this up?
Thank you