0

this is the code that I have. As you can see I append every element to the list if the element is not already in the list but I noticed I still somehow get duplicate elements.

def getExtraData(table):
    extraData = list()
    for ele in table:
        extras = re.findall('\[(.+?)\]', str(ele[0]))
        for extra in extras:
            single = extra.split(", ")
            for s in single:
                if s not in extraData:
                    extraData.append(s)
    return extraData

Took a screenshot in pycharm debugger console to show that the element is really the same.

enter image description here

Why could this happen and how can I fix it?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
Jani
  • 39
  • 1

2 Answers2

2

Why could this happen and how can I fix it?

There is nothing to fix, everything is fine. You get "Box Set" and "Box set" because these are different strings. If you want to be case-insensitive, store lowercase versions, and test on lowercase too, like

if s.lower() not in extraData:
  extraData.append(s.lower())

Furthermore, why do you use list at all? This should be just a set (which reduces computational complexity of in from O(N) to nearly O(1))

def getExtraData(table):
    extraData = set()
    for ele in table:
        extras = re.findall('\[(.+?)\]', str(ele[0]))
        for extra in extras:
            single = extra.split(", ")
            for s in single:
                extraData.add(s.lower())
    return list(extraData)

or even a bit shorter (and slightly faster as we omit python looping)

def getExtraData(table):
    extraData = set()
    for ele in table:
        extras = re.findall('\[(.+?)\]', str(ele[0]))
        for extra in extras:
            extraData.update(map(str.lower, extra.split(", ")))
    return list(extraData)
lejlot
  • 64,777
  • 8
  • 131
  • 164
1

Alternatively to lejlot, if you want to preserve the case of string when storing them in extraData, you could use a generator for the check part:

if s.lower() not in map(str.lower, extraData):
  extraData.append(s)

Inspired by Case insensitive 'in' - Python.

Since extraData is a list in your case, there shouldn't be any significant performance penalty to this solution.

Community
  • 1
  • 1
arekolek
  • 9,128
  • 3
  • 58
  • 79
  • this will be twice as slow (two iterations over list, and reallocation of the memory, as for each element in the generator you create a lower-case string, which, as immutable object, has to be copied), but of course the sime big-O complexity. Plus now you will get the first capitalization from the container stored, and so the process is ordering-dependent. – lejlot May 29 '16 at 13:13