Code is working slow - performance issue in python

Question

I have file which has 4 columns with, separated values. I need only first column only so I have read file then split that line with, separated and store it in one list variable called first_file_list.

I have another file which has 6 columns with, separated values. My requirement is read first column of first row of file and check that string is exist in list called first_file_list. If that is exist then copy that line to new file.

My first file has approx. 6 million records and second file has approx. 4.5 million records. Just to check the performance of my code instead of 4.5 million I have put only 100k records in second file and to process the 100k record code takes approx. 2.5 hours.

Following is my logic for this:

first_file_list = []

with open("c:\first_file.csv") as first_f:
    next(first_f)  # Ignoring first row as it is header and I don't need that
    temp = first_f.readlines()
    for x in temp:
        first_file_list.append(x.split(',')[0])
first_f.close()

with open("c:\second_file.csv") as second_f:
    next(second_f)
    second_file_co = second_f.readlines()
second_f.close()

out_file = open("c:\output_file.csv", "a")
for x in second_file_co:
    if x.split(',')[0] in first_file_list:
        out_file.write(x)
out_file.close()

Can you please help me to get to know that what I am doing wrong here so that my code take this much time to compare 100k records? or can you suggest better way to do this in Python.

Make `first_file_list` a `set` where each entry is a delimited string rather than a `list`. — Axe319, Sep 08 '21 at 13:26
By the way, using the `with` statement means you don't have to call `.close()` on open file objects. — Axe319, Sep 08 '21 at 13:28
You can iterate over the file directly (`for x in first_f`); there's no need to read the entire file into memory first. — chepner, Sep 08 '21 at 13:31
I'd suggest not using "Lac" or more correctly "lakh" since it is not well know AFAIK outside of India/certain regions. I like English and it is THE language--learn it well (which you have not done yet BTW) please, you'll be glad you did. :-) — Andrew, Sep 08 '21 at 13:45
I would also recommend not using the indian way of doing numbers in 2+2+3 digits. That will definitely cause people to misread them. — James Z, Sep 08 '21 at 15:29

Axe319 · Accepted Answer · 2021-09-08T13:42:35.610

Use a set for fast membership checking. Also, there's no need to copy the contents of the entire file to memory. You can just iterate over the remaining contents of the file.

first_entries = set()
with open("c:\first_file.csv") as first_f:
    next(first_f)
    for line in first_f:
        first_entries.add(line.split(',')[0])

with open("c:\second_file.csv") as second_f:
    with open("c:\output_file.csv", "a") as out_file:
        next(second_f)
        for line in second_f:
            if line.split(',')[0] in first_entries:
                out_file.write(line)

Additionally, I noticed you called .close() on file objects that were opened with the with statement. Using with (context managers) means all the clean up is done after you exit its context. So it handles the .close() for you.

It just takes 20 seconds to complete the file with 45 Lakh records. 45Lakh = 45,00,000 — Vinkesh Shah, Sep 08 '21 at 14:52

score 3 · Answer 2 · answered Sep 08 '21 at 13:33

3

work with sets - see below

first_file_values = set()
second_file_values = set()

with open("c:\first_file.csv") as first_f:
    next(first_f)
    temp = first_f.readlines()
    for x in temp:
        first_file_values.add(x.split(',')[0])

with open("c:\second_file.csv") as second_f:
    next(second_f)
    second_file_co = second_f.readlines()
    for x in second_file_co:
        second_file_values.add(x.split(',')[0])

with open("c:\output_file.csv", "a") as out_file:
    for x in second_file_values:
        if x in first_file_values:
            out_file.write(x)

answered Sep 08 '21 at 13:33

balderman

22,927
7
34
52

Good new! Can you explain what `Lakh` is? – balderman Sep 08 '21 at 14:50
45Lakh = 45,00,000 – Vinkesh Shah Sep 08 '21 at 14:53

Code is working slow - performance issue in python

2 Answers2