1

I have a dataframe with 50K rows. I would like to replace 20% of data with random values (giving interval of random numbers). The purpose is to generate synthetic outliers to test algorithms. The following dataframe is small part of df that i have. The values that should be replaced with random outliers is 'value' column.

import pandas as pd
dict = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ], 
        'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"], 
        'value':[90, 91, 80, 87, 84,94, 91, 94]} 

df = pd.DataFrame(dict) 

print(df)
        date      time  value
0  2016-11-10  22:00:00     90
1  2016-11-10  23:00:00     91
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     87
4  2016-11-11  02:00:00     84
5  2016-11-11  03:00:00     94
6  2016-11-11  04:00:00     91
7  2016-11-11  05:00:00     94

For example, i want to give an interval of random values from 1 to 50, and the desired df would look like as following:

        date      time  value
0  2016-11-10  22:00:00     90
1  2016-11-10  23:00:00     91
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     4
4  2016-11-11  02:00:00     84
5  2016-11-11  03:00:00     94
6  2016-11-11  04:00:00     32
7  2016-11-11  05:00:00     94

I would appreciate any ideas. Thanks!

Sascha
  • 687
  • 1
  • 8
  • 22
  • If I understand you correctly, do you want to replace 20% of values in `'value'` column with random values from 1 to 50? – Andrej Kesely Dec 29 '19 at 20:03
  • yes. or with several intervals if it is possible for example from 1 to 50 and from 200-300. Is it possible to do it automatically? – Sascha Dec 29 '19 at 20:08
  • 1
    Using "dict" as a variable name is not suggested as it is a builtin name representing a dictionary – Marios Keri Dec 29 '19 at 20:21

5 Answers5

1

Here is a numpy example that should be fast. The example that includes both the higher and lower replacement assumes that you want to replace the high and low values evenly (50-50) if that is not the case you can change the p in mask_high = np.random.choice([0,1], p=[.5, .5], size=rand.shape).astype(np.bool) to whatever you want.

d = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ], 
        'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"], 
        'value':[90, 91, 80, 87, 84,94, 91, 94]} 

df = pd.DataFrame(d) 

# create a function
def myFunc(df, replace_pct, start_range, stop_range, replace_col):
    # create an array of the col you want to replace
    val = df[replace_col].values 
    # create a boolean mask for the percent you want to replace
    mask = np.random.choice([0,1], p=[1-replace_pct, replace_pct], size=val.shape).astype(np.bool)
    # create a random ints between the range
    rand = np.random.randint(start_range, stop_range, size=len(mask[mask == True]))
    # replace values in the original array
    val[mask] = rand
    # update column
    df[replace_col] = val
    return df

myFunc(df, .2, 1, 50, 'value')

         date      time  value
0  2016-11-10  22:00:00     90
1  2016-11-10  23:00:00     91
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     87
4  2016-11-11  02:00:00     46
5  2016-11-11  03:00:00     94
6  2016-11-11  04:00:00     91
7  2016-11-11  04:00:00     94

timeit

%%timeit
myFunc(df, .2, 1, 50, 'value')

397 µs ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Example of both both high and low replacement

# create a function
def myFunc2(df, replace_pct, start_range_low, stop_range_low,
            start_range_high, stop_range_high, replace_col):
    # create array of col you want to replace
    val = df[replace_col].values 
    # create a boolean mask for the percent you want to replace
    mask = np.random.choice([0,1], p=[1-replace_pct, replace_pct], size=val.shape).astype(np.bool)
    # create a random int between ranges
    rand = np.random.randint(start_range_low, stop_range_low, size=len(mask[mask == True]))
    # create a mask for the higher range
    mask_high = np.random.choice([0,1], p=[.5, .5], size=rand.shape).astype(np.bool)
    # create random ints between high ranges
    rand_high = np.random.randint(start_range_high, stop_range_high, size=len(mask_high[mask_high == True]))
    # replace values in the rand array
    rand[mask_high] = rand_high
    # replace values in the original array
    val[mask] = rand
    # update column
    df[replace_col] = val
    return df

myFunc2(df, .2, 1, 50, 200, 300, 'value')


         date      time  value
0  2016-11-10  22:00:00     90
1  2016-11-10  23:00:00    216
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     49
4  2016-11-11  02:00:00     84
5  2016-11-11  03:00:00     94
6  2016-11-11  04:00:00    270
7  2016-11-11  04:00:00     94

timeit

%%timeit
myFunc2(df, .2, 1, 50, 200, 300, 'value')

493 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • For whoever wants to generate a pandas dataframe with synthetic data in a controlled manner, take a look at this article: https://towardsdatascience.com/generating-fake-data-with-pandas-very-quickly-b99467d4c618?sk=01b74598e672e0ea8d1afb070a4e085c – Juan Luis Ruiz-tagle Jul 31 '21 at 15:31
0

This might work.

outliers = []
def get_outlier(x):
    num = 3
    mean_ = np.mean(x)
    std_ = np.std(x)
    for y in x:
        z_score = (y - mean_) / std_
        if np.abs(z_score) > num:
            outliers.append(y)
    return get_outlier

detect_outliers = get_outlier(df['value'])
sorted(df['value'])
q1, q3 = np.percentile(df['value'], [25, 75])
iqr = q3 - q1
lb = q1 - (1.5 * iqr)
ub = q3 - (1.5 * iqr)

for i in range(len(df)):
    if ((df['value'][i] < lb) | (df['value'][i] > ub)):
        df['value'][i] = np.random.randint(1, 50)
Tunahan A.
  • 142
  • 8
0

Another attempt, using DataFrame.sample().

import numpy as np
import pandas as pd

d = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ],
     'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"],
     'value':[90, 91, 80, 87, 84,94, 91, 94]}

df = pd.DataFrame(d)

random_rows = df.sample(frac=.2)    # 20% random rows from `df`

# we are replacing these 20% random rows with values from 1..50 and 200..300 (in ~1:1 ratio)
random_values = np.random.choice( np.concatenate( [np.random.randint(1, 50, size=len(random_rows) // 2 + 1),
                                                   np.random.randint(200, 300, size=len(random_rows) // 2 + 1)] ),
                size=len(random_rows) )
df.loc[random_rows.index, 'value'] = random_values
print(df)

This prints (for example):

         date      time  value
0  2016-11-10  22:00:00     31   <-- 31
1  2016-11-10  23:00:00     91
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     87
4  2016-11-11  02:00:00     84
5  2016-11-11  03:00:00    236   <-- 236
6  2016-11-11  04:00:00     91
7  2016-11-11  04:00:00     94
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks for detailed answer. i have a question regarding `size=len(random_rows) // 2 + 1` part. if you spesify number of random values with `size=len(random_rows)` then what does `// 2+1` do? – Sascha Dec 30 '19 at 10:07
  • @Sascha `len(random_rows) // 2 + 1` means `(len(random_rows) // 2) + 1`. `//` is [floor (or integer) division](https://stackoverflow.com/questions/183853/what-is-the-difference-between-and-when-used-for-division) so number of selected rows div 2 plus 1. – Andrej Kesely Dec 30 '19 at 10:18
0

Similar answer using a sample :

Example df :

import pandas as pd 

df = pd.DataFrame({"time_col" : pd.date_range("2018-01-01", "2019-01-01", freq = "H")})
df["date"], df["time"] = df["time_col"].dt.date, df["time_col"].dt.hour 
df["value"] = pd.np.random.randint(100, 150, df.shape[0])

seed = 11 # deterministic behavior, that's what heroes do
rnd_rows_idx = df.sample(frac = 0.2, random_state=seed).index # grabbing indexes

original_rows = df.loc[rnd_rows_idx, "value"] # keeping a trace of original values

### Replacing the values selected at random ### 

df.loc[rnd_rows_idx, "value"] = pd.np.random.randint(1, 50, rnd_rows_idx.shape[0])
Nathan Furnal
  • 2,236
  • 3
  • 12
  • 25
-1

Here are a couple steps you can use. As noted above, you should NOT use dict as a variable name. I did below, because I just copied your code inputs.

This code generates a list of indices based on the replacement ratio and the length of the data frame and then replaces values at those location with uniform random integers from 0-20, inclusively:

In [49]: # %load 32-36 
    ...: df=pd.DataFrame(dict) 
    ...: import random 
    ...: replacement_ratio = 0.50 
    ...: replacement_count = int(replacement_ratio * len(df)) 
    ...: replacement_idx = random.sample(range(len(df)), replacement_count)            

In [50]: replacement_idx                                                              
Out[50]: [5, 2, 7, 6]

In [51]: for idx in replacement_idx: 
    ...:     df.loc[idx, 'value'] = random.randint(0,20) 
    ...:                                                                              

In [52]: df                                                                           
Out[52]: 
         date      time  value
0  2016-11-10  22:00:00     90
1  2016-11-10  23:00:00     91
2  2016-11-11  00:00:00      4
3  2016-11-11  01:00:00     87
4  2016-11-11  02:00:00     84
5  2016-11-11  03:00:00      4
6  2016-11-11  04:00:00     17
7  2016-11-11  04:00:00      8

In [53]: 
AirSquid
  • 10,214
  • 2
  • 7
  • 31