4

I'm comparing two list of the same structure, one with the full dataset, and one with a subset.list_a ist the full list and list_b ist the subset. The result should be a list_c with the rows which are different or not in list_b.

for row_a in file_a:
    for row_b in file_b:
        if row_a != row_b:
            file_c.append(row_a)

The if statement seams to be wrong as file_c has multiple times the values of file_a and file_b.

Manuel
  • 143
  • 2
  • 9

3 Answers3

3

Actually Python has something like set data structure. Maybe it would be beneficial to use it in your case?

file_a = ['a', 'b', 'c']
file_b = ['b']
set(file_a).difference(file_b)
Out[4]: {'a', 'c'}
list(set(file_a).difference(file_b))
Out[5]: ['a', 'c']
erhesto
  • 1,176
  • 7
  • 20
2

For sure there are better ways to do the job, but the following should do:

file_c.extend((row for row in file_a if row not in file_b))
fernandezcuesta
  • 2,390
  • 1
  • 15
  • 32
0

Should be able to do something simple like this:

file_c = list(set(file_a) - set(file_b))

Should be fairly low overhead using builtins. I suppose it may be the same as

list(set(file_a).difference(file_b)) 

from erhesto's answer. I'm not sure if the builtin method is faster than the sub overload on list().

Okay, after testing this is what I've found out. I set up two different files sub.py and dif.py

Outputs:

   swift@pennywise practice $ time python sub.py
[27, 17, 19, 31]

real    0m0.055s
user    0m0.044s
sys 0m0.008s
swift@pennywise practice $ time python dif.py
[17, 19, 27, 31]

real    0m0.056s
user    0m0.032s
sys 0m0.016s

Body of the .py files:

sub.py:

#!/usr/bin/python3.6
# -*- coding utf-8 -*-


def test():
    lsta = [2, 3, 5, 7, 9, 13, 17, 19, 27, 31,]
    lstb = [2, 3, 5, 7, 9, 13,]

    lstc = list(set(lsta) - set(lstb))

    return lstc

if __name__ == '__main__':
    print(test())

dif.py

#!/usr/bin/python3.6
# -*- coding utf-8 -*-


def test():
    lsta = [2, 3, 5, 7, 9, 13, 17, 19, 27, 31,]
    lstb = [2, 3, 5, 7, 9, 13,]

    lstc = list(set(lsta).difference(lstb))

    return lstc

if __name__ == '__main__':
    print(test())

Edited because I realized an error - forgot to execute the programs!

The sub operator is substantially faster on the system than the set.difference So, I would probably stick with the '-' over the set.difference...it's easier for me to read what's going on.

Source for the set() - set() functionality: https://stackoverflow.com/a/3462160/9268051

Jamie Crosby
  • 152
  • 7