I have file which has 4 columns with, separated values. I need only first column only so I have read file then split that line with, separated and store it in one list variable called first_file_list.
I have another file which has 6 columns with, separated values. My requirement is read first column of first row of file and check that string is exist in list called first_file_list. If that is exist then copy that line to new file.
My first file has approx. 6 million records and second file has approx. 4.5 million records. Just to check the performance of my code instead of 4.5 million I have put only 100k records in second file and to process the 100k record code takes approx. 2.5 hours.
Following is my logic for this:
first_file_list = []
with open("c:\first_file.csv") as first_f:
next(first_f) # Ignoring first row as it is header and I don't need that
temp = first_f.readlines()
for x in temp:
first_file_list.append(x.split(',')[0])
first_f.close()
with open("c:\second_file.csv") as second_f:
next(second_f)
second_file_co = second_f.readlines()
second_f.close()
out_file = open("c:\output_file.csv", "a")
for x in second_file_co:
if x.split(',')[0] in first_file_list:
out_file.write(x)
out_file.close()
Can you please help me to get to know that what I am doing wrong here so that my code take this much time to compare 100k records? or can you suggest better way to do this in Python.