4

Consider a population with skewed class distribution as in

     ErrorType   Samples
        1          XXXXXXXXXXXXXXX
        2          XXXXXXXX
        3          XX
        4          XXX
        5          XXXXXXXXXXXX

I would like to randomly sample 20 out of 40 without undersampling any of the classes with smaller participation. For example in the above case, I would want to sample as follows

     ErrorType   Samples
        1          XXXXX|XXXXXXXXXX
        2          XXXXX|XXX
        3          XX***|
        4          XXX**|
        5          XXXXX|XXXXXXX

i.e. 5 of Type -1 and -2 and -3, 2 of Type -3 and 3 of Type -4

  1. This guarantees I have sample of size as close to my target i.e. 20 samples
  2. None of the classes has under participation esp classes -3 and -4.

I ended up writing a circumlocutious code, but I believe there can be an easier way to utilize pandas methods or some sklearn functions.

 sample_size = 20 # Just for the example
 # Determine the average participaction per error types
 avg_items = sample_size / len(df.ErrorType.unique())
 value_counts = df.ErrorType.value_counts()
 less_than_avg = value_counts[value_counts < avg_items]
 offset = avg_items * len(value_counts[value_counts < avg_items]) - sum(less_than_avg)
 offset_per_item = offset / (len(value_counts) - len(less_than_avg))
 adj_avg = int(non_act_count / len(value_counts) + offset_per_item)
 df = df.groupby(['ErrorType'],
                 group_keys=False).apply(lambda g: g.sample(min(adj_avg, len(g)))))
Abhijit
  • 62,056
  • 18
  • 131
  • 204
  • So the data provided is what you actually have or is it to illustrate the problem? – Bharath M Shetty Nov 30 '17 at 14:07
  • @Bharath: For illustration purpose. – Abhijit Nov 30 '17 at 14:08
  • Out of curiosity, Is there a you can show us a sample of how actual data is? All I get looking at the data is regex to replace in mind. But this has nothing to do with strings right? – Bharath M Shetty Nov 30 '17 at 14:09
  • The data is in the form of a `pandas.dataframe` with 100s of columns and millions of rows of varied datatypes (string, int, float - ordinal, cardinal). The class I am using to stratify is a Category Code with 15 category codes for now but would grow. My use case is ML and not some text processing. Please refer the sample code I included with the question. – Abhijit Nov 30 '17 at 14:12
  • Now thats interesting. You want a sample from each row with atmost 5 samples right? Even a row as less than 5 then all of them should be present. – Bharath M Shetty Nov 30 '17 at 14:14
  • @Bharath: Not exactly. For the given example, I need to randomly sample 20 items, without undersampling any of the classes. For example, in the final sample, I would still want to see as many samples of types 3 and 4 as it is in the original population. – Abhijit Nov 30 '17 at 14:22

3 Answers3

2

You can make use of a helper column to find samples with length more than the sample size and use pd.Series.sample i.e

Example :

df = pd.DataFrame({'ErrorType':[1,2,3,4,5],
               'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100)]})

df['new'] =df['Samples'].str.len().where(df['Samples'].str.len()<5,5)
# this is let us know how many samples can be extracted per row
#0    5
#1    5
#2    3
#3    2
#4    5
Name: new, dtype: int64
# Sampling based on newly obtained column i.e 
df.apply(lambda x : pd.Series(x['Samples']).sample(x['new']).tolist(),1)

0    [52, 81, 43, 60, 46]
1         [8, 7, 0, 9, 1]
2               [2, 1, 0]
3                  [1, 0]
4    [29, 24, 16, 15, 69]
Name: sample2, dtype: object

I wrote a function to return the sample sizes with thresh i.e

def get_thres_arr(sample_size,sample_length): 
    thresh = sample_length.min()
    size = np.array([thresh]*len(sample_length))
    sum_of_size = sum(size)
    while sum_of_size< sample_size:
        # If the lenght is more than threshold then increase the thresh by 1 i.e  
        size = np.where(sample_length>thresh,thresh+1,sample_length)
        sum_of_size = sum(size)
        #increment threshold
        thresh+=1
    return size

df = pd.DataFrame({'ErrorType':[1,2,3,4,5,1,7,9,4,5],
                   'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100),np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100)]})
ndf = pd.DataFrame({'ErrorType':[1,2,3,4,5,6],
                   'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(1),np.arange(2),np.arange(100)]})


get_thres_arr(20,ndf['Samples'].str.len())
#array([5, 5, 3, 1, 2, 5])

get_thres_arr(20,df['Samples'].str.len())
#array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Now you get the sizes you can use :

df['new'] = get_thres_arr(20,df['Samples'].str.len())
df.apply(lambda x : pd.Series(x['Samples']).sample(x['new']).tolist(),1)

0    [64, 89]
1      [4, 0]
2      [0, 1]
3      [1, 0]
4    [41, 80]
5    [25, 84]
6      [4, 0]
7      [2, 0]
8      [1, 0]
9     [34, 1]

Hope it helps.

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • But how did you calculate the limit to be 5? In my case, it was 5 because that guarantees I would pick up 20 items even if some classes have items less than 5. – Abhijit Nov 30 '17 at 14:24
  • You have hardcoded 5 in your code without showing how you arrived to that magic number. – Abhijit Nov 30 '17 at 14:26
  • Your code is a variation of mine, except that you have not shown how to calculate `adj_avg ` – Abhijit Nov 30 '17 at 14:28
  • @Abhijit so you want to find that magical number so the entire total sample size after taking sample from each row would be 20 am I right? – Bharath M Shetty Nov 30 '17 at 14:30
  • Yes and no. If there is some functions which can do the sampling without determining the sample size per class, it should also work. – Abhijit Nov 30 '17 at 14:32
  • Yeah I meant that only. To make the sample size 20 by creating a threshold. – Bharath M Shetty Nov 30 '17 at 14:33
  • @Abhijit clarify something instead of 2 samples what if there is only 1 sample in row 3, what should be the threshold? – Bharath M Shetty Nov 30 '17 at 15:06
  • Ideally it should be 5 either of types -1, -2 or -4, 4 either of the remaining types of -1, -2 or -5 and 2 of Type -3 and 3 of Type -4. But a close approximation wherein 5 of Type -1 and -2 and -3, 2 of Type -3 and 3 of Type -4 should do fine. This is what my current code does. That is why I mentioned "This guarantees I have sample of **size as close to my target** i.e. 20 samples" – Abhijit Nov 30 '17 at 16:34
  • To find the parameters you have to give more examples so we can use `numpy.linalg.solve` this one is real hard. – Bharath M Shetty Nov 30 '17 at 16:38
  • @Abhijit I managed to write a code for that, check if it helps. – Bharath M Shetty Nov 30 '17 at 17:22
1

Wow. Got nerd sniped on this one. I've written a function that will do what you want in numpy, without any magic numbers.... it's not pretty , but I couldn't waste all that time writing something and not post it as an answer. Now there's two outputs n_for_each_label and random_idxs which are the number of selections to make for each class and the randomly selected data respectively. I can't think why you would want n_for_each_label when you have random_idxs though.

EDIT: As far as I'm aware there is no functionality to do this in scikit, it's not a very common way to dice up your data for ML so I doubt there is anything.

# This is your input, sample size and your labels
sample_size = 20
# in your case you'd just want y = df.ErrorType
y = np.hstack((np.ones(15), np.ones(8)*2,
               np.ones(2)*3, np.ones(3)*4,
               np.ones(12)*5))
y = y.astype(int)
# y = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
 #     3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

# Below is the function
unique_labels = np.unique(y)
bin_c = np.bincount(y)[unique_labels]
label_mat = np.ones((bin_c.shape[0], bin_c.max()), dtype=int)*-1
for i in range(unique_labels.shape[0]):
    label_loc = np.where(y == unique_labels[i])[0]
    np.random.shuffle(label_loc)
    label_mat[i, :label_loc.shape[0]] = label_loc
random_size = 0
i = 1
while random_size < sample_size:
    i += 1
    random_size = np.sum(label_mat[:, :i] != -1)

if random_size == sample_size:
    random_idxs = label_mat[:, :i]
    n_for_each_label = np.sum(random_idxs != -1, axis=1)
    random_idxs = random_idxs[random_idxs != -1]
else:
    random_idxs = label_mat[:, :i]
    last_idx = np.where(random_idxs[:, -1] != -1)[0]
    n_drop = random_size - sample_size
    drop_idx = np.random.choice(last_idx, n_drop)
    random_idxs[drop_idx, -1] = -1
    n_for_each_label = np.sum(random_idxs != -1, axis=1)
    random_idxs = random_idxs[random_idxs != -1]

Ouput:

n_for_each_label = array([5, 5, 2, 3, 5])

The number from each of your error types to sample, or if you want to skip to the end:

random_idxs = array([ 3, 11, 8, 13, 9, 22, 15, 17, 20, 18, 23, 24, 25, 26, 27, 36, 32, 38, 35, 33])

piman314
  • 5,285
  • 23
  • 35
  • You know @ncfirth Op is looking a way to find a threshold i.e he's is looking for how many samples from each row would lead to the desired sample size i.e if he has 10,10,3,2,10 then he wants 5,5,3,2,5 so that would be equal to 20. If he has 10,10,4,4,10 then he want 4,4,4,4,4 so it would be equal to 20. – Bharath M Shetty Nov 30 '17 at 16:42
  • Run your code over the sample data from my answer, nobody understands mere code. You should mention how to plug this into the OPs dataframe. – Bharath M Shetty Nov 30 '17 at 16:51
  • There's sample data in my answer too, I can make it more explicit though. Given OP's question it's a trivial connection `y=df.ErrorType` – piman314 Nov 30 '17 at 17:07
0

No magic numbers. Simply sample from the entire population, coded in an obvious way.

The first step is to replace each 'X' with the numeric code of the stratum in which it appears. Thus coded, the entire population is stored in one string, called entire_population.

>>> strata = {}
>>> with open('skewed.txt') as skewed:
...     _ = next(skewed)
...     for line in skewed:
...         error_type, samples = line.rstrip().split()
...         strata[error_type] = samples
... 
>>> whole = []
>>> for _ in strata:
...     strata[_] = strata[_].replace('X', _)
...     _, strata[_]
...     whole.append(strata[_])
...     
('3', '33')
('2', '22222222')
('1', '111111111111111')
('5', '555555555555')
('4', '444')
>>> entire_population = ''.join(whole)

Given the constraint that the sample_size must be 20, randomly sample from the entire population to form a complete sample.

>>> sample = []
>>> sample_size = 20
>>> from random import choice
>>> for s in range(sample_size):
...     sample.append(choice(entire_population))
...     
>>> sample
['2', '5', '1', '5', '1', '1', '1', '3', '5', '5', '5', '1', '5', '2', '5', '1', '2', '2', '2', '5']

Finally, characterise the sample as a sampling design by counting the representatives on each stratum in it.

>>> from collections import Counter
>>> Counter(sample)
Counter({'5': 8, '1': 6, '2': 5, '3': 1})
Bill Bell
  • 21,021
  • 5
  • 43
  • 58