I have a large dataset that I pulled from Data.Medicare.gov (https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6)
It's a cvs of all physicians (2.4 million rows by 41 columns, 750MB), lets call this physician_df
, however, I cannot load into memory on my computer (memory error).
I have another df loaded in memory (summary_df) and I want to join columns (NPI, Last Name, First Name) from physician_df.
Is there any way to do this without having to load the data to memory? I first attempted by using their API but I get capped out (I have about 500k rows in my final df and this will always be changing). Would storing the physician_df into a SQL database make this easier?
Here are snippets of each df (fyi, the summary_df is all fake information).
summary_df
DOS Readmit SurgeonNPI
1-1-2018 1 1184809691
2-2-2018 0 1184809691
2-5-2017 1 1093707960
physician_df
NPI PAC ID Professional Enrollment LastName FirstName
1184809691 2668563156 I20120119000086 GOLDMAN SALUJA
1184809691 4688750714 I20080416000055 NOLTE KIMBERLY
1093707960 7618879354 I20040127000771 KHANDUJA KARAMJIT
Final df:
DOS Readmit SurgeonNPI LastName FirstName
1-1-2018 1 1184809691 GOLDMAN SALUJA
2-2-2018 0 1184809691 GOLDMAN SALUJA
2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If I could load the physician_df then I would use the below code..
pandas.merge(summary_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])