Python Script takes much time / List comprehensions

Question

I have a small script which compares, from CSV input files, how many items of the first list are in the second list. However, it takes a certain time to run when there is many references.

data_1 = import_csv("test1.csv")

print(len(data_1))

data_2 = import_csv("test2.csv")

print(len(data_2))

data_to_keep = len([i for i in data_1 if i in data_2])

I just run a test with 598756 items for the first list and 76612 for the second, and the script hasn't finished yet.

As I'm still relatively new to Python, I would like to know if there is a fastest way to achieve what I'm trying to do. Thank you for your help :)

EDIT : import CSV looks like this :

def import_csv(csvfilename):
    data = []
    with open(csvfilename, "r", encoding="utf-8", errors="ignore") as scraped:
        reader = csv.reader(scraped, delimiter=',')
        for row in reader:
            if row:  # avoid blank lines
                data.append(row[0])

    return data

Yes, can you create a small set of data for testing. Are open to using additional libraries? numpy, pandas, etc... — Scott Boston, Dec 27 '18 at 15:56
I don't know numpy or pandas but I can try if better this way :) — , Dec 27 '18 at 15:57
Membership tests using lists, i.e. `if i in data_2` are linear time. Thus, your algorithm is quadratic. Use a `set` for `data_2` — juanpa.arrivillaga, Dec 27 '18 at 16:03

score 4 · Accepted Answer · answered Dec 27 '18 at 16:01

4

Make data_2 a set.

data_2 = set(import_csv("test2.csv"))

In Python, sets are much faster for checking if an object is present (using the in operator).

You might also see some improvement from switching the order of your inputs. Make the larger file the set, that way you do fewer lookups when you iterate over the elements of the smaller file.

answered Dec 27 '18 at 16:01

Bill the Lizard

398,270
210
566
880

Really much faster ! Thanks a lot – Dec 27 '18 at 16:11

score 2 · Answer 2 · answered Dec 27 '18 at 16:02

2

You can use set and it's intersection, if duplicates can be safely discarded:

data1 = [1,2,3,3,4]
data2 = [2,3,5,6,1,6]

print(len(set(data1).intersection(data2)))
# 3

This is set operation and is guaranteed to be faster than what you do.

answered Dec 27 '18 at 16:02

Austin

25,759
4
25
48

I tried this also but had only 74 matches, while I got around 20 000 with Bill the Lizard answer, does it comes from duplicates ? – Dec 27 '18 at 16:13
@Araxide, presumably, yes. They might be duplicates. – Austin Dec 27 '18 at 16:15

score 0 · Answer 3 · answered Dec 27 '18 at 16:08

Try it

import csv
with open('test1.csv', newline='') as csvfile:
    list1 = csv.reader(csvfile, delimiter=',')

with open('test2.csv', newline='') as csvfile2:
    list2 = csv.reader(csvfile2, delimiter=',')

data_to_keep = len([i for i in list1 if i in list2])

score 0 · Answer 4 · answered Dec 27 '18 at 16:14

I'm making a few assumptions here but here's an idea...
test1.csv and test2.csv hold something unique, like serial numbers. Like...

9210268126,4628032171,6691918168,1499888554,2024134986, 8826205840,5643225730,3174290295,1881330725,7192644763, 7210351670,7956881819,4897219228,4638431591,6444695480, 1949859915,8919131597,2176933146,3875411064,3546520925

Try...

with open("test1.csv") as f1, open("test2.csv") as f2:  
    data_1 = [line.split(",") for line in f1]
    data_2 = [line.split(",") for line in f2]

Since they're unique we can use the set functions to see which entries are in the other file:

data_to_keep = set(data_1).intersection(set(data_2))

I'm not sure how to do it faster - at that point it might be a hardware bottleneck.

score 0 · Answer 5 · answered Dec 27 '18 at 16:34

That one should also work. It converts the list to a dictionary and avoids a sequential search that is performed using the in operator. In large datasest you often avoid the use of in operator.

data_1 = import_csv("test1.csv")
data_2 = dict([(i,i) for  i in import_csv("test2.csv")])
data_to_keep = len([i for i in data_1 if data_2.get(i) is not None])

Python Script takes much time / List comprehensions

5 Answers5