0

Hi guys I have a json file with 1200 city names and their populations

I'm combining those cities by 3 using itertools.combination:

import json
from itertools import combinations as com

    def main():
        data = json.load(open('new.json'))
        x = com(data, 3)
        triples_dict = [i for i in x]
    main()

This is just a test code, but it takes forever to run, I know it's a lot of combinations, and I don't need actually that many - later I will choose from overall results only 10000 combinations based on some linear probability - the bigger the sum of 3 cities' population the less possible this combination will be included into those 10000 combinations.

I am looking for a way to optimize the combination phase, I was thinking about skipping some combinations, for example, to skip every 4th combination, but I can't find a way to do that

Eugene
  • 57
  • 1
  • 7
  • There are 287,280,400 combinations of three out of 1200. Making a list of those tuples takes an absurd amount of memory, so you shouldn't do that. That's why itertools.combinations is a generator, after all. You'll need to reduce that number by quite a lot to make it manageable; I doubt whether eliminating a quarter of the combinations will help much. – rici Jan 07 '22 at 03:41
  • Try `random.sample` instead. – o11c Jan 07 '22 at 03:44
  • @o11c `random.sample` will not work on a generator (the itertools.combinations). Please read here - https://stackoverflow.com/questions/12581437/python-random-sample-with-a-generator-iterable-iterator. – Akshay Sehgal Jan 07 '22 at 03:46
  • Yes, but you can avoid itertools *entirely* and just create samples in a loop instead. – o11c Jan 07 '22 at 03:50
  • but that would still need the complete list of combinations -> 12000 ** 3 cities – Akshay Sehgal Jan 07 '22 at 03:52
  • @AkshaySehgal Okay, I've gone ahead and proven you don't need a generator at all. The only materialized list is the one "loaded" from the JSON, and *possibly* the duplicate check if there's a need to add that. – o11c Jan 07 '22 at 04:21
  • its not exactly combinations, but i guess this would work if OP is looking to get just samples as well. Upvoted. – Akshay Sehgal Jan 07 '22 at 04:23

2 Answers2

2

You don't need itertools at all. Just manually create some random combinations using random.sample.

Theoretically this can return the same sample more than once, though that is quite unlikely. If this is a concern, you can use a set to track tuples you've seen.

#!/usr/bin/env python3

import random

def load_cities():
    # hard-code some data from Wikipedia instead of loading from JSON
    # to simplify the demonstration
    return [
        ("Tōkyō", 37400068),
        ("Delhi", 28514000),
        ("Shanghai", 25582000),
        ("São Paulo", 21650000),
        ("Mexico City", 21581000),
        ("Cairo", 20076000),
        ("Mumbai", 19980000),
        ("Beijing", 19618000),
        ("Dhaka", 19578000),
        ("Osaka", 19281000),
        ("New York", 18819000),
        ("Karachi", 15400000),
        ("Buenos Aires", 14967000),
        ("Chongqing", 14838000),
        ("Istanbul", 14751000),
        ("Kolkata", 14681000),
        ("Manila", 13482000),
        ("Lagos", 13463000),
        ("Rio de Janeiro", 13293000),
        ("Tianjin", 13215000),
        ("Kinshasa", 13171000),
        ("Guangzhou", 12638000),
        ("Los Angeles", 12458000),
        ("Moscow", 12410000),
        ("Shenzhen", 11908000),
        ("Lahore", 11738000),
        ("Bangalore", 11440000),
        ("Paris", 10901000),
        ("Bogotá", 10574000),
        ("Jakarta", 10517000),
        ("Chennai", 10456000),
        ("Lima", 10391000),
        ("Bangkok", 10156000),
        ("Seoul", 9963000),
        ("Nagoya", 9507000),
        ("Hyderabad", 9482000),
        ("London", 9046000),
        ("Tehran", 8896000),
        ("Chicago", 8864000),
        ("Chengdu", 8813000),
        ("Nanjing", 8245000),
        ("Wuhan", 8176000),
        ("Ho Chi Minh City", 8145000),
        ("Luanda", 7774000),
        ("Ahmedabad", 7681000),
        ("Kuala Lumpur", 7564000),
        ("Xi'an", 7444000),
        ("Hong Kong", 7429000),
        ("Dongguan", 7360000),
        ("Hangzhou", 7236000),
        ("Foshan", 7236000),
        ("Shenyang", 6921000),
        ("Riyadh", 6907000),
        ("Baghdad", 6812000),
        ("Santiago", 6680000),
        ("Surat", 6564000),
        ("Madrid", 6497000),
        ("Suzhou", 6339000),
        ("Pune", 6276000),
        ("Harbin", 6115000),
        ("Houston", 6115000),
        ("Dallas", 6099000),
        ("Toronto", 6082000),
        ("Dar es Salaam", 6048000),
        ("Miami", 6036000),
        ("Belo Horizonte", 5972000),
        ("Singapore", 5792000),
        ("Philadelphia", 5695000),
        ("Atlanta", 5572000),
        ("Fukuoka", 5551000),
        ("Khartoum", 5534000),
        ("Barcelona", 5494000),
        ("Johannesburg", 5486000),
        ("Saint Petersburg", 5383000),
        ("Qingdao", 5381000),
        ("Dalian", 5300000),
        ("Washington, D.C.", 5207000),
        ("Yangon", 5157000),
        ("Alexandria", 5086000),
        ("Jinan", 5052000),
        ("Guadalajara", 5023000),
    ]

def main():
    all_cities = load_cities()
    for _ in range(10):
        some_cities = random.sample(all_cities, 3)
        total_pop = 0
        names = []
        for city, pop in some_cities:
            names.append(city)
            total_pop += pop
        names[-1] = 'and ' + names[-1]
        names = ', '.join(names)
        print('The total population of', names, 'is', total_pop)

if __name__ == '__main__':
    main()
o11c
  • 15,265
  • 4
  • 50
  • 75
  • "*Theoretically this can return the same sample more than once, though that is quite unlikely.*" "Quite unlikely" is not very precise but I think you're underestimating the probability of collision. If you generate 10,000 random 3-combinations of 1,200 things, the probability that they are all distinct is about 84%. In almost one of six experiments, you'll find at least on duplicate. (That's the config in the OP.) At 20,000 random 3-combinations, the probability of a duplicate is about 50%. Welcome to the birthday paradox. (Sqrt(1200C3) is about 17,000.) So dupe-checking is probably needed. – rici Jan 08 '22 at 23:06
  • Sure, they might happen *somewhere* in the subset of data. But they will be rare enough that it won't affect basically anything you can actually *do* with the data. – o11c Jan 08 '22 at 23:12
1

Try this with building a generator comprehension with the condition you need.

  1. Here I take random 12000 IDs
  2. Then I create a generator from the combinations of these ids and keep only the ones that have an index as a multiple of 4. (you can change that with more conditions such as i%4<3 etc.
  3. Finally, I loop 10 times (change to 10000 etc.) to get the first 10 combinations, without making it run infinitely.
from itertools import combinations
import numpy as np

ids = np.random.randint(0,500000,(12000,))

generator = (j for i,j in enumerate(combinations(ids, 3)) if i%4==0)

for i in range(10):
  print(next(generator))
(105466, 481866, 68347)
(105466, 481866, 267588)
(105466, 481866, 298501)
(105466, 481866, 318543)
(105466, 481866, 479957)
(105466, 481866, 327264)
(105466, 481866, 437524)
(105466, 481866, 61672)
(105466, 481866, 390489)
(105466, 481866, 90756)

... the bigger the sum of 3 cities' population the less possible this combination will be included into those 10000 combinations.

  1. This might be difficult without an analytical approach. There is no way you can estimate the population sum of every combination of the city without running the combinations.

  2. What this means is, you may benefit from simply analyzing the distribution of population for a single list of cities.

import matplotlib.pyplot as plt
import numpy as np
from itertools import combinations

a = np.random.normal(size=(100,))

fig, axes = plt.subplots(1,2, figsize=(10,5))

axes[0].hist(a)
axes[1].hist([sum(i) for i in combinations(a,3)])

## NOTICE THE DISTRIBUTION OF THE ORIGINAL LIST AND COMBINATIONS

enter image description here

  1. This will allow you to estimate the distribution of the sum of populations and set that condition in the generator object itself. You can estimate what would be a cutoff for your population sum over which you want to sample.

  2. Another way is to take a large enough sample from the generator to estimate the population distribution from the sample distribution. Then use that as a condition to fetch enough samples for your downstream task.

Here is how the code would look like in general -

cap = 1000000

generator = (j for i,j in enumerate(combinations(ids, 3)) if i%4==0 and sum(j)>cap)

for i in range(10):
  comb = next(generator)
  print(comb,'->',sum(comb))
(105466, 481866, 479957) -> 1067289
(105466, 481866, 437524) -> 1024856
(105466, 481866, 481925) -> 1069257
(105466, 481866, 427610) -> 1014942
(105466, 481866, 490693) -> 1078025
(105466, 481866, 435187) -> 1022519
(105466, 481866, 499844) -> 1087176
(105466, 481866, 422053) -> 1009385
(105466, 481866, 436762) -> 1024094
(105466, 481866, 490919) -> 1078251
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
  • The obvious problem with this is that the first numbers will never change. – o11c Jan 07 '22 at 03:51
  • what do you mean? OP needs a sample of combinations. I am simply explaining how to add conditions to fetch the samples, rather than iterating over them completely which will take an impossible amount of space and memory. – Akshay Sehgal Jan 07 '22 at 03:58
  • If you mean that OP will have to take the samples sequentially, rather than get them in a shuffled manner, that is not possible until you materialize the generator into a list. Please do read https://stackoverflow.com/questions/21187131/how-to-use-random-shuffle-on-a-generator-python. Let me know if that is what you meant. – Akshay Sehgal Jan 07 '22 at 04:01
  • @o11c btw, on a side note, i love your website :) – Akshay Sehgal Jan 07 '22 at 04:04