If I understand you correctly, you want to compare the values in column 4 and column 1, and if they are equal, output a new column with the value from column 3.
To do this, simply use np.where
as follows:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'CUI':['C0161894','C0029730','C0176693','C0029730','C0000074'],
'ICD9/10':[39,398,398,3989,3989],
'Out':[4000001,4000002,4000003,4000004,4000005],
'Lookup':['C0000005','C0000039','C0000052','C0000074','C0000074']})
df1['Match'] = np.where(df1.Lookup == df1.CUI, df1.Out, 'No Match')
Output:
CUI ICD9/10 Lookup Out Match
0 C0161894 39 C0000005 4000001 No Match
1 C0029730 398 C0000039 4000002 No Match
2 C0176693 398 C0000052 4000003 No Match
3 C0029730 3989 C0000074 4000004 No Match
4 C0000074 3989 C0000074 4000005 4000005
Edit:
In response to your comment, you can use the chunksize
parameter in pandas.read_csv
to read in only parts of your dataframe:
For data
in csv as follows:
CUI ICD9/10 Lookup Out
C0161894 39 C0000005 4000001
C0029730 398 C0000039 4000002
C0176693 398 C0000052 4000003
C0029730 3989 C0000074 4000004
C0000074 3989 C0000074 4000005
See https://stackoverflow.com/a/25962187/2254228: You can do:
chunksize = 1000
for chunk in pd.read_csv(data, chunksize=chunksize):
# process(chunk) using the solution above
# Output Chunk to new csv using `pd.to_csv('new_data')`
Edit2: Here I have compiled full sample code for you. Replace the file data
and new_data
with whatever your data file is called and replace the file paths with your file paths. This will avoid any memory errors from your datafile being too big.
For some a sample data.csv
:
CUI ICD9/10 Lookup Out
C0161894 39 C0000005 4000001
C0029730 398 C0000039 4000002
C0176693 398 C0000052 4000003
C0029730 3989 C0000074 4000004
C0000074 3989 C0000074 4000005
Create a target csv file new_data
as an empty csv file to store your new data frame:
CUI ICD9/10 Lookup Out
Then import the old data, splitting it into chunk, where chunksize = the number of lines of the file to read in:
# Read in line by line = set chunksize = 1
chunksize = 1
# Open New File
with open("Pathtonewfile/new_data.csv", "w") as f:
# Iterate over the old data.csv file, reading in one line
for chunk in pd.read_csv('Pathtooldfile/data.csv', index_col = False, chunksize=chunksize):
# Carry out Lookup Calculation as above
chunk['Match'] = np.where(chunk.Lookup == chunk.CUI, chunk.Out, 'No Match')
# Write the new dataframe chunk to "new_data.csv"
chunk.to_csv(f, header=False, index=False,
cols=['CUI','ICD9/10','Out','Lookup'],
mode = 'a')
This gives you an output in new_data.csv
as follows:
CUI ICD9/10 Lookup Out Match
0 C0161894 39 C0000005 4000001 No Match
1 C0029730 398 C0000039 4000002 No Match
2 C0176693 398 C0000052 4000003 No Match
3 C0029730 3989 C0000074 4000004 No Match
4 C0000074 3989 C0000074 4000005 4000005