Searching one .CSV File with the Columns of another .CSV File

Question

I'm working with two Excel files. One is a very large (~5 to 6 GB) data set. This is from the government's Open Payments Data, which is free and open to everyone to view. It is a file that lists all of the disclosed payments from Industry to physicians, and it is open to the public (which is why I took a screenshot).

The second Excel file I'm working with is also large, but it's a file that lists the disclosed payments from physicians at a particular institution.

My goals:

I'd like to 'filter' out the Open Payments Data to just include the physicians I have in my second Excel file. Is there any way to do that? The Open Payments Data is inconsistent and has some uppercase and lower case.

What I've done so far:

I've been able to parse out the Open Payments Data to just include the state of the physicians I'm looking for. I've also imported both of these .csv files into R and named them accordingly.

I'm taking a course in R right now but it's been no help ... and most of the answers I've found online are for smaller sets of data. The data I'm working with has ~500,000 rows! Thank you in advance for your insight.

Edit: This is head(mydata)

 Physician_Profile_ID Physician_First_Name
1               377519                KELLI
2               377519                KELLI
3               377519                KELLI
4               272641                ABDUL
5               272641                ABDUL
6               272641                ABDUL
  Physician_Middle_Name Physician_Last_Name
1                     A                AABY
2                     A                AABY
3                     A                AABY
4                     A               AADAM
5                     A               AADAM
6                                     AADAM
  Physician_Name_Suffix
1                      
2                      
3                      
4                      
5                      
6                      
  Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name
1                          BioHorizons Implant Systems Inc.
2                          BioHorizons Implant Systems Inc.
3                          BioHorizons Implant Systems Inc.
4                                    APOLLO ENDOSURGERY INC
5                                    APOLLO ENDOSURGERY INC
6                             BOSTON SCIENTIFIC CORPORATION
  Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name
1                              BioHorizons Implant Systems Inc.
2                              BioHorizons Implant Systems Inc.
3                              BioHorizons Implant Systems Inc.
4                                        Apollo Endosurgery Inc
5                                        APOLLO ENDOSURGERY INC
6                                 Boston Scientific Corporation
  Total_Amount_of_Payment_USDollars Date_of_Payment
1                             11.55       6/17/2014
2                            187.50        6/4/2014
3                            222.24       5/23/2014
4                             60.20        5/4/2014
5                            110.15       7/28/2014
6                             12.36      12/10/2014
  Form_of_Payment_or_Transfer_of_Value
1           In-kind items and services
2           In-kind items and services
3           In-kind items and services
4           In-kind items and services
5           In-kind items and services
6           In-kind items and services
  Nature_of_Payment_or_Transfer_of_Value City_of_Travel
1                      Food and Beverage               
2                                   Gift               
3                              Education               
4                      Food and Beverage               
5                      Food and Beverage               
6                      Food and Beverage               
  State_of_Travel Country_of_Travel
1                                  
2                                  
3                                  
4                                  
5                                  
6

And this is head(institution_data, 2):

  DB.ID Last.Name First.Name
1 12345    Johnson    John
2 12354     Twain      Mark

Names have been changed for confidentiality. DB ID != Physician_ID unfortunately.

You could just do a vlookup but half a million rows might require some processing power and possibly an hour XD Perhaps divide up the open data (as you've done with states) and do it in segments? — findwindow, Aug 10 '15 at 21:35
Build an array of unique names from your single institution list (see [this](http://stackoverflow.com/questions/31891059/extracting-the-collection-of-unique-values-from-a-filter-in-vba/31891475#31891475)) then use the [AutoFilter Method](https://msdn.microsoft.com/en-us/library/office/aa221844%28v=office.11%29.aspx) with your new array as the criteria. — , Aug 10 '15 at 21:37
It should not be difficult to do this in R. Should be pretty easy to build a vector pf physician_ID's that are in both files and then pull those. But if you are doing this in Excel you should definitely remove that [r] tag. If you are doing in R, then post `head(Open_Payments_Data)` and `head(Particular_Institution_Data)`. — IRTFM, Aug 10 '15 at 21:42
Hi everyone. I'm trying to do this in R. This file originally couldn't be opened in Excel, and to divide up the data manually using Excel would take ages with the 500,000 rows. :( @Jeeped it looks like that's using something other than R? (Please correct me if I'm mistaken). — Susie, Aug 10 '15 at 21:45
If you're trying to do this in R and you're using csvs instead of "excel files", then your title is confusing. The download itself comes in csv according to the website... — Frank, Aug 10 '15 at 21:46
You wouldn't divide up the data manually in excel but since you're not using it, might want to remove the tags... — findwindow, Aug 10 '15 at 21:53

IRTFM · Accepted Answer · 2015-08-10T22:09:49.607

A list (vector actually) of physician IDs could be constructed:

 PHY_ID <- unique( 
        institution_data$DB.ID[ institution_data$DB.ID %in% mydata$Physician_Profile_ID ] )

Then extract the data from the main file using the matches to that vector:

chargedata <- mydata[ mydata$Physician_Profile_ID %in% PHY_ID , ]

Could also use match with the same logic but the %in% function uses match "under the hood" and code written with %in% is generally easier to read. If the ID's were not supposed to match, which you should have stated if that were the case, then name matching could be attempted but it would make sense to add additional criteria, such as state of nearby zipcode.

Searching one .CSV File with the Columns of another .CSV File

1 Answers1