How to read through large csv or database and join columns when memory is an issue?

Question

I have a large dataset that I pulled from Data.Medicare.gov (https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6)

It's a cvs of all physicians (2.4 million rows by 41 columns, 750MB), lets call this physician_df, however, I cannot load into memory on my computer (memory error).

I have another df loaded in memory (summary_df) and I want to join columns (NPI, Last Name, First Name) from physician_df.

Is there any way to do this without having to load the data to memory? I first attempted by using their API but I get capped out (I have about 500k rows in my final df and this will always be changing). Would storing the physician_df into a SQL database make this easier?

Here are snippets of each df (fyi, the summary_df is all fake information).

summary_df

DOS        Readmit    SurgeonNPI   
1-1-2018   1          1184809691   
2-2-2018   0          1184809691   
2-5-2017   1          1093707960

physician_df

NPI          PAC ID      Professional Enrollment   LastName FirstName
1184809691   2668563156  I20120119000086           GOLDMAN  SALUJA
1184809691   4688750714  I20080416000055           NOLTE    KIMBERLY
1093707960   7618879354  I20040127000771           KHANDUJA KARAMJIT

Final df:

DOS        Readmit    SurgeonNPI  LastName FirstName
1-1-2018   1          1184809691  GOLDMAN  SALUJA
2-2-2018   0          1184809691  GOLDMAN  SALUJA
2-5-2017   1          1093707960  KHANDUJA KARAMJIT

If I could load the physician_df then I would use the below code..

pandas.merge(summary_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])

metakermit · Answer 1 · 2018-03-25T17:31:58.580

I would try to import the data into a database and do the joins there (e.g. Postgres if you want a relational DB – there are pretty nice ORMs for it, like peewee). Maybe you can then use SQL operations to get a subset of the data you are most interested in, export it and can process it using Pandas. Also, take a look at Ibis for working with databases directly – another project Wes McKinney, the author of Pandas worked on.

It would be great to use Pandas with an on-disk storage system, but as far as I know that's not an entirely solved problem yet. There's PyTables (a bit more on using PyTables with Pandas here), but it doesn't support joins in the same SQL-like way that Pandas does.

jpp · Accepted Answer · 2018-03-26T00:10:15.060

For your desired output, you only need 3 columns from physician_df. It is more likely 2.4mio rows of 3 columns can fit in memory versus 5 (or, of course, all 41 columns).

So I would first try extracting what you need from a 3-column dataset, convert to a dictionary, then use it to map required columns.

Note, to produce your desired output, it is necessary to drop duplicates (keeping first) from physicians_df, so I have included this logic.

from operator import itemgetter as iget

d = pd.read_csv('physicians.csv', columns=['NPI', 'LastName', 'FirstName'])\
      .drop_duplicates('NPI')\
      .set_index('NPI')[['LastName', 'FirstName']]\
      .to_dict(orient='index')

# {1093707960: {'FirstName': 'KARAMJIT', 'LastName': 'KHANDUJA'},
#  1184809691: {'FirstName': 'SALUJA', 'LastName': 'GOLDMAN'}}

df_summary['LastName'] = df_summary['SurgeonNPI'].map(d).map(iget('LastName'))
df_summary['FirstName'] = df_summary['SurgeonNPI'].map(d).map(iget('FirstName'))

#         DOS  Readmit  SurgeonNPI  LastName FirstName
# 0  1-1-2018        1  1184809691   GOLDMAN    SALUJA
# 1  2-2-2018        0  1184809691   GOLDMAN    SALUJA
# 2  2-5-2017        1  1093707960  KHANDUJA  KARAMJIT

If your final dataframe is too large to store in memory, then I would consider these options:

Chunking: split your dataframe into small chunks and output as you go along.
PyTables: based on numpy + HDF5.
dask.dataframe: based on pandas and uses out-of-core processing.

Joshua Cook · Answer 3 · 2018-03-25T20:05:03.990

Sampling!

import pandas as pd
import random

n = int(2.4E7)
n_sample = 2.4E5
filename = "https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6"
skip = sorted(random.sample(xrange(n),n-s))
physician_df = pd.read_csv(filename, skiprows=skip)

Then this should work fine

summary_sample_df = summary_df[summary_df.SurgeonNPI.isin(physician_df.NPI)]
merge_sample_df = pd.merge(summary_sample_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])

Pickle your merge_sample_df. Sample again. Wash, rinse, repeat to desired confidence.

How to read through large csv or database and join columns when memory is an issue?

3 Answers3