Grouping / clustering numbers in Python with relative relations between the numbers

Question

I need something similar than asked here, but more the other way round:

Given a list of numbers (example from the question above): [1, 6, 9, 100, 102, 105, 109, 134, 139] I do not want to say the max gap (in that example 10) but how many groups I would like.

e.g. 3 groups: 1, 6, 9 / 100, 102, 105, 109 / 134, 139 2 groups: 1, 6, 9 / 100, 102, 105, 109, 134, 139 ...

this should work relative, as my numbers are very, very different: [0.1, 0.2, 1, 4, 100, 110] => 3 groups should lead to 0.1, 0.2 / 1, 4 / 100, 110

although 0.2 and 1 are nearer than 1 and 5 in absolute terms (0.8 vs. 3) in relation 0.2 is further away from 1 (5x) than to 0.1 (2x).

I hope it gets clear somehow what I would like to achieve...

what did you try already? it is always better to have a starting point. — Netwave, Apr 25 '19 at 08:25
1. find largest gap 2. divide group with largest gap at largest gap 3. when not enough groups repeat — Lee, Apr 25 '19 at 08:26
How about some simple clustering algorithm? Have you tried any? — yatu, Apr 25 '19 at 08:28
If your criterion for grouping is the ratio of successive numbers, your first data sample would be split in 3 groups like this: `[[1], [6, 9], [100, 102, 105, 109, 134, 139]]`. If you want the result you gave in your question, you have to use a different criterion - please make it explicit for us! — Thierry Lathuille, Apr 25 '19 at 08:50

FMc · Answer 1 · 2019-04-25T14:29:56.437

import sys

# Sorted numbers.
xs = [0.1, 0.2, 1, 4, 100, 110]
xs.sort()

# Reverse-sorted (RATIO, INDEX) tuples.
tups = sorted(
    (
        (xs[i] / xs[i - 1] if i else 0, i)
        for i, x in enumerate(xs)
    ),
    reverse = True,
)

# Indexes of the boundaries having the largest ratios.
n_groups = int(sys.argv[1])
boundaries = sorted(tup[1] for tup in tups[0 : n_groups - 1])
boundaries.append(None)

# Create the groups, based on those boundaries.
groups = []
i = 0
for j in boundaries:
    groups.append(xs[i:j])
    i = j

# Check.
for g in groups:
    print(g)

Example output:

# n_groups = 1
[0.1, 0.2, 1, 4, 100, 110]

# n_groups = 2
[0.1, 0.2, 1, 4]
[100, 110]

# n_groups = 3
[0.1, 0.2]
[1, 4]
[100, 110]

# n_groups = 4
[0.1, 0.2]
[1]
[4]
[100, 110]

# n_groups = 5
[0.1]
[0.2]
[1]
[4]
[100, 110]

Jundiaius · Answer 2 · 2019-04-25T09:08:41.507

If the list of numbers is not huge, I would first calculate every ratio between a number and its precedent and then choose the largest splits.

For example:

def split_numbers_list(numbers_list, n_groups):
    # If the number of groups is 1 or less, the whole list the only group
    if n_groups <= 1:
        return [numbers_list]
    n_splits = min(len(numbers_list), n_groups) # Can't have more groups than elements
    # Now we calculate the ratios and store them in (index, ratio) tuples
    ratios = [
        (i, numbers_list[i + 1] / numbers_list[i]) for i in range(len(numbers_list) - 1)
    ]
    sorted_ratios = sorted(ratios, key=lambda r: r[1], reverse=True)
    # `chosen_splits` stores the boundaries of each group
    chosen_splits = (
        [0]
        + sorted([r[0] + 1 for r in sorted_ratios[: n_splits - 1]])
        + [len(numbers_list)]
    )
    return [
        numbers_list[chosen_splits[i] : chosen_splits[i + 1]]
        for i in range(len(chosen_splits) - 1)
    ]

But attention to what this supposes about numbers_list: it has to be a sorted list of strictly positive numbers, otherwise this function will not work.

Some examples:

>>> my_list = [1, 2, 3, 4, 5, 100, 101, 103, 504, 2301]

>>> split_numbers_list(my_list, 5)
[[1], [2, 3, 4, 5], [100, 101, 103], [504], [2301]]

>>> split_numbers_list(my_list, 3)
[[1, 2, 3, 4, 5], [100, 101, 103], [504, 2301]]

Grouping / clustering numbers in Python with relative relations between the numbers

2 Answers2