Use Vlook_up using Python

Question

I have four columns in my excel data file:

CUI      ICD9/10    Out      Lookup
C0161894    39      4000001 C0000005
C0029730    398     4000002 C0000039
C0176693    398     4000003 C0000052
C0029730    3989    4000004 C0000074

I want to match 4th column from the 1st column and get 3rd column as the output using python. As the data is large so indirectly i want to use vLookups, but here i dont have any specific value. I need to search in whole column

So, you want to search for where the values in `Lookup` are equal to the values in `CUI`? Can you provide some code for how you are using this data in Python? — Chuck, Apr 15 '17 at 09:55

score 0 · Accepted Answer · edited May 23 '17 at 12:10

If I understand you correctly, you want to compare the values in column 4 and column 1, and if they are equal, output a new column with the value from column 3.

To do this, simply use np.where as follows:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'CUI':['C0161894','C0029730','C0176693','C0029730','C0000074'],
                   'ICD9/10':[39,398,398,3989,3989],
                   'Out':[4000001,4000002,4000003,4000004,4000005],
                   'Lookup':['C0000005','C0000039','C0000052','C0000074','C0000074']})                       


df1['Match'] = np.where(df1.Lookup == df1.CUI,  df1.Out, 'No Match')

Output:

        CUI  ICD9/10    Lookup      Out     Match
0  C0161894       39  C0000005  4000001  No Match
1  C0029730      398  C0000039  4000002  No Match
2  C0176693      398  C0000052  4000003  No Match
3  C0029730     3989  C0000074  4000004  No Match
4  C0000074     3989  C0000074  4000005   4000005

Edit:

In response to your comment, you can use the chunksize parameter in pandas.read_csv to read in only parts of your dataframe:

For data in csv as follows:

     CUI  ICD9/10    Lookup      Out
C0161894       39  C0000005  4000001
C0029730      398  C0000039  4000002
C0176693      398  C0000052  4000003
C0029730     3989  C0000074  4000004
C0000074     3989  C0000074  4000005

See https://stackoverflow.com/a/25962187/2254228: You can do:

chunksize = 1000
for chunk in pd.read_csv(data, chunksize=chunksize):
    # process(chunk) using the solution above
    # Output Chunk to new csv using `pd.to_csv('new_data')`

Edit2: Here I have compiled full sample code for you. Replace the file data and new_data with whatever your data file is called and replace the file paths with your file paths. This will avoid any memory errors from your datafile being too big.

For some a sample data.csv:

     CUI  ICD9/10    Lookup      Out
C0161894       39  C0000005  4000001
C0029730      398  C0000039  4000002
C0176693      398  C0000052  4000003
C0029730     3989  C0000074  4000004
C0000074     3989  C0000074  4000005

Create a target csv file new_data as an empty csv file to store your new data frame:

CUI  ICD9/10    Lookup      Out

Then import the old data, splitting it into chunk, where chunksize = the number of lines of the file to read in:

# Read in line by line = set chunksize = 1
chunksize = 1

# Open New File
with open("Pathtonewfile/new_data.csv", "w") as f:

    # Iterate over the old data.csv file, reading in one line
    for chunk in pd.read_csv('Pathtooldfile/data.csv', index_col = False, chunksize=chunksize):

        # Carry out Lookup Calculation as above
        chunk['Match'] = np.where(chunk.Lookup == chunk.CUI,  chunk.Out, 'No Match')

        # Write the new dataframe chunk to "new_data.csv"
        chunk.to_csv(f, header=False, index=False, 
                     cols=['CUI','ICD9/10','Out','Lookup'],
                     mode = 'a')

This gives you an output in new_data.csv as follows:

        CUI  ICD9/10    Lookup      Out     Match
0  C0161894       39  C0000005  4000001  No Match
1  C0029730      398  C0000039  4000002  No Match
2  C0176693      398  C0000052  4000003  No Match
3  C0029730     3989  C0000074  4000004  No Match
4  C0000074     3989  C0000074  4000005   4000005

I know, but i have some 8 million rows, so it is not possible to make a dataframe, becuase intially i was using the same value — , Apr 15 '17 at 18:37
@RachitAgarwal Please see my edit. You can use the `chunksize` parameter in `pd.read_csv` to read in only a small amount of your data. Then you can process each chunk with the formula in my original answer, and save it to a new csv file, iterating through each chunk. Make sense? — Chuck, Apr 15 '17 at 18:57

Use Vlook_up using Python

1 Answers1