2

I have a huge list >1m entries in it.

and my bin size [0,1,2,3.....1000]

So, for 0 bin size all >1m entries pass and so on...

I need a fast solution, I tried coding it but it is quite slow.

Any help is appreciated. Thanks.

Input-
input_list = [0,0,0,1,2,3,55,34,......] (almost 1m in Len)
bins = [0,1,2,....., 1000]

Output-
{0:1.00, 1:0.99, 2:998........1000:0.02}
where key is bin,
      value is ratio of values greater than or equal to particular bin to total entries in list.


vinay kusuma
  • 65
  • 1
  • 9

2 Answers2

2

A very simple approach: Calculate no of elements greater than the element and divide by no of records.

import numpy as np

data = np.random.randint(2000, size=10**6)
bins = np.arange(1000) 
dic = {}
for bi in bins:
    dic[bi] = np.count_nonzero(data>=bi)/len(data)
Equinox
  • 6,483
  • 3
  • 23
  • 32
1

If I understand your question correctly, you can use numpy.histogram. The following chunk of code should do if you substitute in your own input_list and bins:

import numpy as np

# Filling in dummy data
input_list = [np.random.randint(low=0, high=100) for i in range(100)]

# Setup bins as [1, 2, 3, ... 100]
bins = [i for i in range(1, 101)]

# Run numpy.histogram
hist, bin_edges = np.histogram(input_list, bins=bins)

# Find cumulative sum
cumsum = np.array([sum(hist[:i]) for i in range(len(hist))])

# Find ratios
ratios = (len(data) - cumsum) / len(data)

The ratios variable contains what you're looking for, i.e. ratio of values greater than or equal to particular bin.

tnwei
  • 860
  • 7
  • 15