Optimisation for large dataset

Question

I have posted a code for review in here. Yet as per now it did not receive the correct respond which I assume is due to the lengthiness of the code. Here I shall cut it to the chase. Suppose we have the following lists:

t0=[('Albania','Angola','Germany','UK'),('UK','France','Italy'),('Austria','Bahamas','Brazil','Chile'),('Germany','UK'),('US')]
t1=[('Angola', 'UK'), ('Germany', 'UK'), ('UK', 'France'), ('UK', 'Italy'), ('France', 'Italy'), ('Austria', 'Bahamas')]
t2=[('Angola:UK'), ('Germany:UK'), ('UK:France'), ('UK:Italy'), ('France:Italy'), ('Austria:Bahamas')]

the aim is for each pair in t1 we go through t0 and if the pair is found we replace it with the corresponding t3 element, we can do this using the following:

result = []
for v1, v2 in zip(t1, t2):
    out = []
    for i in t0:
        common = set(v1).intersection(i)
        if set(v1) == common:
            out.append(tuple(list(set(i) - common) + [v2]))
        else:
            out.append(tuple(i))
    result.append(out)

pprint(result, width=100)

which gives:

[[('Albania', 'Germany', 'Angola:UK'),
  ('UK', 'France', 'Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany:UK'),
  ('UK', 'France', 'Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany:UK',),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('Italy', 'UK:France'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('France', 'UK:Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('UK', 'France:Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('UK', 'France', 'Italy'),
  ('Brazil', 'Chile', 'Austria:Bahamas'),
  ('Germany', 'UK'),
  ('U', 'S')]]

This list has length of 6 which shows that there are 6 elements in t1 and t2 and each sublist has 5 elements which are corresponding to number of elements in t0. As it stands the code is fast yet in my case I have t0 which has length of ~48000 and t1 with length of ~30000. Running time takes almost forever I wonder how one performs same operations with faster methods?

Your last tuple should be like so: `('US',)`. The reason is that `('US')=='US` (both strings) but ('US')!=('US',) - as the last one is a tuple with one element in. — Jonas Byström, Jul 11 '19 at 12:00

vlemaistre · Accepted Answer · 2019-07-11T13:19:46.790

1

You could use a double list comprehension. The code runs approximately 3.47 times faster (13.3 µs vs 46.2 µs).

t0=[('Albania','Angola','Germany','UK'),('UK','France','Italy'),('Austria','Bahamas','Brazil','Chile'),('Germany','UK'),('US')]
t1=[('Angola', 'UK'), ('Germany', 'UK'), ('UK', 'France'), ('UK', 'Italy'), ('France', 'Italy'), ('Austria', 'Bahamas')]
t2=[('Angola:UK'), ('Germany:UK'), ('UK:France'), ('UK:Italy'), ('France:Italy'), ('Austria:Bahamas')]

# We transform the lists of tuple to lists of sets for easier and faster computations
# We transform the lists of tuple to lists of sets for easier and faster computations
t0 = [set(x) for x in t0]
t1 = [set(x) for x in t1]

# We define a function that removes list of elements and adds an element
# from a set 
def add_remove(set_, to_remove, to_add):
    result_temp = set_.copy()
    for element in to_remove:
        result_temp.remove(element)
    result_temp.add(to_add)
    return result_temp

# We do the computation using a double list comprehension
result = [[add_remove(y, x, z) if x.issubset(y) else y for y in t0] 
          for x, z in zip(t1, t2)]

edited Jul 11 '19 at 13:19

answered Jul 11 '19 at 11:56

vlemaistre

3,301
13
30

Thanks but then there is an issue take `result[:1]` you get `[[{'Albania', 'Angola:UK', 'Germany'}, {'Italy', 'UK:France'}, {'Austria:Bahamas', 'Brazil', 'Chile'}, {'Germany:UK'}, {'S', 'U'}]]` which is wrong each time one pair should be taken not all possible. Look at the out put of mine in the question. – Wiliam Jul 11 '19 at 12:05
Since we were working with sets there is no order. That is why the output wasn't in the right order. I changed it to lists to have your expected outcome, but it will be a bit slower – vlemaistre Jul 11 '19 at 12:20
So now when I copy your answer it throws error: `AttributeError: 'list' object has no attribute 'copy'` – Wiliam Jul 11 '19 at 12:25
1

`list.copy()` was [implemented](https://stackoverflow.com/questions/2612802/how-to-clone-or-copy-a-list) in python 3.3. If you have an older version you should change the line `result_temp = list_.copy()` by `result_temp = list(list_)` – vlemaistre Jul 11 '19 at 12:29
1

Thank you I'm gonna wait a little longer for more participation and if there won't be any faster method, I'll accept yours - it is indeed way faster than what I had but still it will take around 250mins to go through my original data that I'm handling and I would need to rerun it so I think I still need to go more down in minutes. – Wiliam Jul 11 '19 at 13:00
1

@William no problem for the wait. Anyways, I think I managed to make the sets work, it's three times faster than my other method. Feel free to check my edited answer – vlemaistre Jul 11 '19 at 13:19
I checked it on sample of my dataset - went faster by a few seconds indeed. Thanks. – Wiliam Jul 11 '19 at 13:34
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/196368/discussion-between-vlemaistre-and-william). – vlemaistre Jul 12 '19 at 11:25

Optimisation for large dataset

1 Answers1