1

I have a small script which compares, from CSV input files, how many items of the first list are in the second list. However, it takes a certain time to run when there is many references.

data_1 = import_csv("test1.csv")

print(len(data_1))

data_2 = import_csv("test2.csv")

print(len(data_2))

data_to_keep = len([i for i in data_1 if i in data_2])

I just run a test with 598756 items for the first list and 76612 for the second, and the script hasn't finished yet.

As I'm still relatively new to Python, I would like to know if there is a fastest way to achieve what I'm trying to do. Thank you for your help :)

EDIT : import CSV looks like this :

def import_csv(csvfilename):
    data = []
    with open(csvfilename, "r", encoding="utf-8", errors="ignore") as scraped:
        reader = csv.reader(scraped, delimiter=',')
        for row in reader:
            if row:  # avoid blank lines
                data.append(row[0])

    return data

5 Answers5

4

Make data_2 a set.

data_2 = set(import_csv("test2.csv"))

In Python, sets are much faster for checking if an object is present (using the in operator).

You might also see some improvement from switching the order of your inputs. Make the larger file the set, that way you do fewer lookups when you iterate over the elements of the smaller file.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
2

You can use set and it's intersection, if duplicates can be safely discarded:

data1 = [1,2,3,3,4]
data2 = [2,3,5,6,1,6]

print(len(set(data1).intersection(data2)))
# 3

This is set operation and is guaranteed to be faster than what you do.

Austin
  • 25,759
  • 4
  • 25
  • 48
  • I tried this also but had only 74 matches, while I got around 20 000 with Bill the Lizard answer, does it comes from duplicates ? –  Dec 27 '18 at 16:13
  • @Araxide, presumably, yes. They might be duplicates. – Austin Dec 27 '18 at 16:15
0

Try it

import csv
with open('test1.csv', newline='') as csvfile:
    list1 = csv.reader(csvfile, delimiter=',')

with open('test2.csv', newline='') as csvfile2:
    list2 = csv.reader(csvfile2, delimiter=',')

data_to_keep = len([i for i in list1 if i in list2])
0

I'm making a few assumptions here but here's an idea...
test1.csv and test2.csv hold something unique, like serial numbers. Like...

9210268126,4628032171,6691918168,1499888554,2024134986, 8826205840,5643225730,3174290295,1881330725,7192644763, 7210351670,7956881819,4897219228,4638431591,6444695480, 1949859915,8919131597,2176933146,3875411064,3546520925

Try...

with open("test1.csv") as f1, open("test2.csv") as f2:  
    data_1 = [line.split(",") for line in f1]
    data_2 = [line.split(",") for line in f2]

Since they're unique we can use the set functions to see which entries are in the other file:

data_to_keep = set(data_1).intersection(set(data_2))

I'm not sure how to do it faster - at that point it might be a hardware bottleneck.

KuboMD
  • 684
  • 5
  • 16
0

That one should also work. It converts the list to a dictionary and avoids a sequential search that is performed using the in operator. In large datasest you often avoid the use of in operator.

data_1 = import_csv("test1.csv")
data_2 = dict([(i,i) for  i in import_csv("test2.csv")])
data_to_keep = len([i for i in data_1 if data_2.get(i) is not None])
Marios Simou
  • 181
  • 3
  • 8