2

I'm coding a hash-table-ish indexing mechanism that returns an integer's interval number (0 to n), according to a set of splitting points.

For example, if integers are split at value 3 (one split point, so two intervals), we can find the interval number for each array element using a simple comparison:

>>> import numpy as np
>>> x = np.array(range(7))
>>> [int(i>3) for i in x]
[0, 0, 0, 0, 1, 1, 1]

When there are many intervals, we can define a function as below:

>>> def get_interval_id(input_value, splits):                                                                                                                                                               
...     for i,split_point in enumerate(splits):
...         if input_value < split_point:
...             return i
...     return len(splits)
... 
>>> [get_interval_id(i, [2,4]) for i in x]
[0, 0, 1, 1, 2, 2, 2]

But this solution does not look elegant. Is there any Pythonic (better) way to do this job?

Prune
  • 76,765
  • 14
  • 60
  • 81
Jiang Xiang
  • 3,166
  • 5
  • 21
  • 31

3 Answers3

2

Python, per se, does not have a tractable function for this process, called binning. If you wanted, you could wrap your function into a one-line command, but it's more readable this way.

However, data frame packages usually have full-featured binning methods; the most popular one in Python is PANDAS. This allows you to collect or classify values by equal intervals, equal divisions (same quantity of entries in each bin), or custom split values (your case). See this question for a good discussion and examples.

Of course, this means that you'd have to install and import pandas and convert your list to a data frame. If that's too much trouble, just keep your current implementation; it's readable, straightforward, and reasonably short.

Prune
  • 76,765
  • 14
  • 60
  • 81
2

Since you're already using it, I would suggest you use the digitize method from numpy:

>>> import numpy as np
>>> np.digitize(np.array([0, 1, 2, 3, 4, 5, 6]), [2, 4])
array([0, 0, 1, 1, 2, 2, 2])

From the documentation:

Return the indices of the bins to which each value in input array belongs.

Sebastian Wozny
  • 16,943
  • 7
  • 52
  • 69
  • No they start from 0. If you want them to start differently just insert bins appropriately. – Sebastian Wozny Nov 02 '17 at 20:22
  • Thank you. This is exactly what I was looking for. I am relatively comfortable with numpy, but did not realize the existence of this function. The name of this function is not very intuitive, as least for me. Do you have any suggestions to better familiar myself with numpy? – Jiang Xiang Nov 02 '17 at 20:25
  • If you want to practice your `pandas`/`numpy` skills start doing data science. (for example kaggle contests). If my answer solved your question, please accept it. – Sebastian Wozny Nov 02 '17 at 20:26
1

How about wrapping the whole process inside of one function instead of only half the process?

>>> get_interval_ids([0 ,1, 2, 3, 4, 5 ,6], [2, 4])
[0, 0, 1, 1, 2, 2, 2]

and your function would look like

def get_interval_ids(values, splits):

    def get_interval_id(input_value):
        for i,split_point in enumerate(splits):
            if input_value < split_point:
                return i
        return len(splits)

    return [get_interval_id(val) for val in values]
noslenkwah
  • 1,702
  • 1
  • 17
  • 26
  • why is it better to wrap it up? please elaborate! – Jiang Xiang Nov 02 '17 at 20:31
  • @JiangXiang - Whether or not it's better depends on how you intend to use it. In general a function should perform one complete task. If you want to get a list of interval_ids given a list of numbers, your function should do exactly that. Nothing more nothing less. – noslenkwah Nov 02 '17 at 22:35
  • Thank you for your explanation! – Jiang Xiang Nov 03 '17 at 00:13