Python: something faster than not in for large lists?

Question

I'm doing a project with word lists. I want to combine two word lists, but only store the unique words.

I'm reading the words from a file and it seems to take a long time to read the file and store it as a list. I intend to copy the same block of code and run it using the second (or any subsequent) word files. The slow part of the code looks like this:

    while inLine!= "":
        inLine = inLine.strip()
        if inLine not in inList:
            inList.append(inLine)
        inLine = inFile.readline()

Please correct me if I'm wrong, but I think the slow(est) part of the program is the "not in" comparison. What are ways I can rewrite this to make it faster?

Possible duplicate of [Removing duplicates in lists](http://stackoverflow.com/questions/7961363/removing-duplicates-in-lists); answers include with and without preserving the original list order options — Chris_Rands, Mar 28 '17 at 15:56

score 6 · Answer 1 · answered Mar 28 '17 at 15:53

6

Judging by this line:

if inLine not in inList:
    inList.append(inLine)

It looks like you are enforcing uniqueness in the inList container. You should consider to use a more efficient data structure, such as an inSet set. Then the not in check can be discarded as redundant, because duplicates will be prevented by the container anyway.

If insertion ordering must be preserved, then you can achieve a similar result by using an OrderedDict with null values.

answered Mar 28 '17 at 15:53

wim

338,267
99
616
750

Maybe you should add the main benefit of using sets in this case: Going from O(n) to O(1) average case complexity. – L3viathan Mar 28 '17 at 16:11

score 0 · Answer 2 · answered Mar 28 '17 at 15:55

0

If you want to combine two lists and remove the duplicates, you could try something like this:

combined_list = list(set(first_list) | set(second_list))

answered Mar 28 '17 at 15:55

jape

2,861
2
26
58

Python: something faster than not in for large lists?

2 Answers2