What is the most efficient way to compute the difference of lines from two files?

Question

I have two lists in python list_a and list_b. The list_a have some images links, and the list_b too. 99% of the items are the same, but i have to know this 1%. The all surplus items are in list_a, that means all items in list_b are in list_a. My initial idea is subtract all items: list_a - list_b = list_c, where the list_c are my surplus items. My code is:

list_a = []
list_b = []
list_c = []

arq_b = open('list_b.txt','r')
for b in arq_b:
    list_b.append(b)

arq_a = open('list_a.txt','r')
for a in arq_a:
    if a not in arq_b:
        list_c.append(a)

arq_c = open('list_c.txt','w')
for c in list_c:
    arq_c.write(c)

I think the logic is right, if i have some items, the code is run fast. But i dont have 10 items, or 1.000, or even 100.000. I have 78.514.022 items in my list_b.txt and 78.616.777 in my list list_a.txt. I dont't know the cost of this expression: if a not in arq_b. But if i execute this code, i think wont finish in this year.

My pc have 8GB, and i allocate 15gb for swap to not explode my RAM.

My question is, there's another way to make this operation more efficiently(Faster)?

The list_a is ordinate but the list_b not.
Each item have this size: images/00000cd9fc6ae2fe9ec4bbdb2bf27318f2babc00.png
The order doesnt matter, i want know the surplus.

Does the order matter? If not, try using sets. With sets, subtraction should be linear: `set_c = set_a - set_b`. — L3viathan, Jan 10 '19 at 12:37
The python will use the most efficient way to make this operation? — Vinicius Morais, Jan 10 '19 at 12:40
Yes, I mean the Python datatype [`set`](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset). — L3viathan, Jan 10 '19 at 12:40
@tripleee It's not a duplicate of that - that question is about mapping subtraction over a list, this question is about the difference between what's included in the lists. — SpoonMeiser, Jan 10 '19 at 12:48
this isn't a duplicate of the question linked. Please don't roboclose... — Jean-François Fabre, Jan 10 '19 at 20:12

Jean-François Fabre · Accepted Answer · 2019-01-10T12:54:10.220

14

you can create one set of the first file contents, then just use difference or symmetric_difference depending on what you call a difference

with open("list_a.txt") as f:
    set_a = set(f)

with open("list_b.txt") as f:
    diffs = set_a.difference(f)

if list_b.txt contains more items than list_a.txt you want to swap them or use set_a.symmetric_difference(f) instead, depending on what you need.

difference(f) works but still has to construct a new set internally. Not a great performance gain (see set issubset performance difference depending on the argument type), but it's shorter.

edited Jan 10 '19 at 12:54

answered Jan 10 '19 at 12:52

Jean-François Fabre

137,073
23
153
219

Nice, this avoids having to allocate space for the second set. – L3viathan Jan 10 '19 at 12:54
1

Well, not really, because internally a `set` is created, then thrown away. but it's thrown away _faster_ – Jean-François Fabre Jan 10 '19 at 12:54
But the complexity is the same of subtract sets? – Vinicius Morais Jan 10 '19 at 13:00
@ViniciusMorais The time complexity is the same, the space complexity (apparently), too. – L3viathan Jan 10 '19 at 13:45
1

@L3viathan In case the original list (the original set) is not needed anymore you can use `difference_update`. This should not require to allocate a new set internally. – a_guest Jan 10 '19 at 14:33

L3viathan · Answer 2 · 2019-01-10T12:51:54.073

11

Try using sets:

with open("list_a.txt") as f:
    set_a = set(f)

with open("list_b.txt") as f:
    set_b = set(f)

set_c = set_a - set_b

with open("list_c.txt","w") as f:
    for c in set_c:
        f.write(c)

The complexity of subtracting two sets is O(n) in the size of the set a.

edited Jan 10 '19 at 12:51

answered Jan 10 '19 at 12:43

L3viathan

26,748
2
58
81

2

You know - an open file is an iterator - therefore you can simply do `set_a = set(open("list_a.txt"))` – jsbueno Jan 10 '19 at 12:47
11

yes but doing `set(f)` in with block ensures that it closes the file – Jean-François Fabre Jan 10 '19 at 12:50

score 2 · Answer 3 · answered Jan 10 '19 at 12:44

To extend the comment of @L3viathan If order of element is not important set is the rigth way. here a dummy example you can adapt:

l1 = [0,1,2,3,4,5]
l2 = [3,4,5]
setL1 = set(l1)  # transform the list into a set
setL2 = set(l2)
setDiff = setl1 - setl2  # make the difference 
listeDiff = list(setDiff)  # if you want to have your element back in a list

as you see is pretty straightforward in python.

a_guest · Answer 4 · 2019-01-10T13:04:28.337

2

In case order matters you can presort the lists together with item indices and then iterate over them together:

list_2 = sorted(list_2)
diff_idx = []
j = 0
for i, x in sorted(enumerate(list_1), key=lambda x: x[1]):
    if x != list_2[j]:
        diff_idx.append(i)
    else:
        j += 1
diff = [list_1[i] for i in sorted(diff_idx)]

This has time complexity of the sorting algorithm, i.e. O(n*log n).

edited Jan 10 '19 at 13:04

answered Jan 10 '19 at 12:57

a_guest

34,165
12
64
118

What is the most efficient way to compute the difference of lines from two files?

4 Answers4