1

I have a lists of tuples (>150k elements) containing an ID with the length of that ID. However, this list should only show IDs that appear in the second list (>80k).

list_of_tuples = [('1', 31.46), ('10', 97.99), ('50', 71.19), ('100', 17.03), ...]
normal_list = ['1', '50', '100', ...]

The desired output is:

list_of_tuples = [('1', 31.46), ('50', 71.19), ('100', 17.03), ...]

Here's the code that I threw together for testing the concept, but as I am new to Python, it doesn't work. I also haven't found a solution online for this kind of issue.

    for whole_elem in list_of_tuples:
        for first_elem in whole_elem:
            for link in normal_list:
                if first_elem <> link
                list_of_tuples.pop(whole_elem)

I would appreciate your support a lot. Thank you very much!

Chris
  • 23
  • 4
  • Would you consider using pandas, that would probably do that in a shorter time? – Roy2012 Jul 28 '20 at 11:42
  • @superbrain I made an adjustment, now it should be clearer – Chris Jul 28 '20 at 11:44
  • 1
    "it doesn't work" tells us nothing. – superb rain Jul 28 '20 at 11:46
  • Is `list_of_tuples` unique by ID? – juanpa.arrivillaga Jul 28 '20 at 11:46
  • A straightforward solution is `tuple(filter(lambda x: x[0] in normal_list, list_of_tuples))`. Whether or not that meets your needs (in terms of algorithm complexity) you will have to say. You can implement something quicker by expending more memory (if e.g. you could have the `normal_list` as a `dict` or by e.g. leveraging on sorting). – sim Jul 28 '20 at 11:47

3 Answers3

3

Does a list comprehension not work here?

filtered_tuples = [tpl for tpl in list_of_tuples if tpl[0] in normal_list]

Since your list is so big, you probably want to turn the normal_list into a set to improve performance:

normal_set = frozenset(normal_list)
filtered_tuples = [tpl for tpl in list_of_tuples if tpl[0] in normal_set]
Mark Reed
  • 91,912
  • 16
  • 138
  • 175
2
for first_elem in whole_elem:

The first element in whole_elem is whole_elem[0]. for first_elem in whole_elem would be a loop over all of its elements.

first_elem = whole_elem[0]
for link in normal_list:
    if first_elem <> link

This runs the body of the if every time there’s an element in normal_list that isn’t equal to first_elem. Instead, you want to check whether normal_list contains first_elem.

if first_elem not in normal_list:
dict_link_length.pop(whole_elem)

Not sure what dict_link_length is. Producing a new list is probably a better idea than modifying the old one, though.

filtered = []

for whole_elem in list_of_tuples:
    first_elem = whole_elem[0]

    if first_elem in normal_list:
        filtered.append(whole_elem)

Now pick more descriptive names for things (a list of tuples is a normal list). Here’s a start:

select_ids = ['1', '50', '100', ...]
filtered = []

for item in items:
    item_id = item[0]

    if item_id in select_ids:
        filtered.append(item)

Checking if an element is in a list with in is slow because it has to check the entire list one at a time, but checking if an element is in a set is constant-time:

select_ids = ['1', '50', '100', ...]
filtered = []

select_ids = frozenset(select_ids)

for item in items:
    item_id = item[0]

    if item_id in select_ids:
        filtered.append(item)

Finally, this can be written as a list comprehension:

select_ids = frozenset(select_ids)
filtered = [item for item in items if item[0] in select_ids]
Ry-
  • 218,210
  • 55
  • 464
  • 476
  • Thank you for your answer and sorry for confusing you with "dict_link_length", I forgot to change it to "list_of_tuples". – Chris Jul 28 '20 at 11:54
1

You can probably solve this conceptually at the same level as asked ("inner join") with pandas join function (look at this question).

However, here I would just do the following:

result = []
normal_set = set(normal_list)  # improves performance of 'contains' check from len(normal_list)/2 (on avarage) to log(len(normal_list)
for tpl in list_of_tuples:
   if tpl[0] in normal_set:
      result.append(tpl)

On top of that, you can yield elements of the result instead of append if you do not need to consume the result as a whole:

def inner_join(normal_list, list_of_tuples):
    result = []
    normal_set = set(normal_list)
    for tpl in list_of_tuples:
       if tpl[0] in normal_set:
          yield tpl

Which you use as:

for el in innner_join(normal_list, list_of_tuples):
    ....
sophros
  • 14,672
  • 11
  • 46
  • 75