2

I'm having this error when trying to remove topics in a List[Tuple[Union[bytes, str], Union[dict, dict]]]

Here a sample of the list:

 analyzed_comments = [('setup.py', {'Topic_0': ['version', 'get'], 'Topic_1': ['version', 'get']}), 
    ('translation.py', {'Topic_0': ['multiline', 'pattern', 'skip'], 'Topic_1': ['multiline', 'concat', 'text']})]

I would like to have a resulting list that stores:

  • name of the file
  • list of non redundant topics

Something like:

comment_topics = [('setup.py', ['version', 'get']), 
                    ('translation.py', ['multiline', 'pattern', 'skip', 'concat', 'text'])]

This is what I wrote, but it doesn't seem to do this job well:

comment_topics = list()
temp_comments = list()
for file, comment in analyzed_comments:
    for topic in comment:
        elem = body[topic]
        temp_comments = list(set(elem + temp_comments))

    tupla = (file, temp_comments)
    comment_topics.append(tupla)
print(comment_topics)

have you got any ideas?

nic
  • 105
  • 6

1 Answers1

2

My idea would be to iterate over the files and simply combine all topics. In the end we create a set from the topics to remove all duplicates:

bodies = [('setup.py', {'Topic_0': ['version', 'get'], 'Topic_1': ['version', 'get']}), 
    ('translation.py', {'Topic_0': ['multiline', 'pattern', 'skip'], 'Topic_1': ['multiline', 'concat', 'text']})]


result = {}
for x in bodies:
    values = []
    for v in x[1].values():
        values.extend(v)
    result[x[0]] = list(set(values))
    
result

Output:

{'setup.py': ['version', 'get'],
 'translation.py': ['multiline', 'pattern', 'text', 'skip', 'concat']}

You can also do it in one line:

{k: list(set(sum(v.values(),[]))) for k, v in bodies}

or with itertools which should be faster than sum() in most cases:

import itertools
result = {k: list(set(itertools.chain.from_iterable(v.values()))) for k, v in bodies}
JANO
  • 2,995
  • 2
  • 14
  • 29
  • Thanks so much! I don't really understand the onliner tho, which one of them is better/more efficient? Because I work on very large datasets – nic Feb 06 '22 at 12:17
  • You're welcome. The onliner uses `sum()` instead of the inner loop with `extend()`, which is not even faster regarding [this post](https://stackoverflow.com/a/45323085/7947994). The outer part in the oneliner is a generator, which has lazy evaluation (good for large datasets) but is not necessarily faster. I would start with the first option as it is more understandable and optimize from there if it is not fast enough. Your problem can be optimized as each operation on a row is independent from all other rows. – JANO Feb 06 '22 at 12:31
  • Had another lool and added a third option. Using a generator is probably not a bad idea if your dataset is large. The performance also varies depending on how many topics each row has and so on. I would simply test the solutions with your dataset and take the fastest one. – JANO Feb 06 '22 at 12:45