3

I have a dataframe that looks as follows:

    data = np.array([[5, 'red', 2,6, 8, 10],
                 [11, 'red', 3,9,6,15],
                 [8, 'blue', 0, 3, 5, 10],
                 [2, 'blue', 1, 2, 3, 4]])
    df = pd.DataFrame(data, columns = ['A','B','red_lower', 'red_upper', 'blue_lower', 'blue_upper'])
    A     B red_lower red_upper blue_lower blue_upper
0   5   red         2         6          8         10
1  11   red         3         9          6         15
2   8  blue         0         3          5         10
3   2  blue         1         2          3          4

I'd like to create an additional column that tells me if the value in a column A is in the range of the color given in column B. For example, in row 0, since 5 has the designation red, I will check if 5 is between 2 and 6. It is, so I will have the new column have a 1.

Desired result:

    A    B   red_lower red_upper blue_lower blue_upper in_range
0   5   red         2         6          8         10        1
1  11   red         3         9          6         15        0
2   8  blue         0         3          5         10        1
3   2  blue         1         2          3          4        0

I've tried to write a loop, but I'm getting many series errors. I really dont want to have to split up the dataframe (by color), but maybe that's the way to go? (in my actual dataframe, there are six different 'colors', not just two).

Thank you!

EDIT: bonus if we have the additional column tell me if the value is above or below the range! For example, in row 1, 11 is outside the range, so is too high. Table should look this way:

    A     B red_lower red_upper blue_lower blue_upper in_range
0   5   red         2         6          8         10   inside
1  11   red         3         9          6         15    above
2   8  blue         0         3          5         10   inside
3   2  blue         1         2          3          4    below

2 Answers2

3

justify + broadcast + mask + logical_and

You can use some nifty broadcasting here, and the function justify from another answer. This assumes that each color has a single valid range. It also assumes that all of your numeric columns are in fact numeric.


values = df.A.values
colors = df.B.values

range_frame = df.iloc[:, 2:]
ranges = range_frame.columns.str.split('_').str[0].values

m = colors != ranges[:, None]
masked = range_frame.mask(m)

jf = justify(masked.values, invalid_val=np.nan)[:, :2]
ir = np.logical_and(jf[:, 0] < values, values < jf[:, 1]).astype(int)

c1 = values <= jf[:, 0]
c2 = values >= jf[:, 1]

irl = np.select([c1, c2], ['below', 'above'], 'inside')

df.assign(in_range=ir, in_range_flag=irl)

    A     B  red_lower  red_upper  blue_lower  blue_upper  in_range in_range_flag
0   5   red          2          6           8          10         1        inside
1  11   red          3          9           6          15         0         above
2   8  blue          0          3           5          10         1        inside
3   3  blue          1          2           3           4         0         below

stack + reshape + logical_and

Again making the same assumptions as the first answer.


u = df.set_index(['A', 'B']).stack().rename_axis(['A', 'B', 'flag']).reset_index()
frame = u[u.flag.str.split('_').str[0] == u.B]

values = frame[::2].A.values
ranges = frame[0].values.reshape(-1, 2)

ir = np.logical_and(ranges[:, 0] < values, values < ranges[:, 1])

c1 = values <= ranges[:, 0]
c2 = values >= ranges[:, 1]

irl = np.select([c1, c2], ['below', 'above'], 'inside')

df.assign(in_range=ir, in_range_flag=irl)

Here is the definition for the justify function by @Divakar:

def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out
user3483203
  • 50,081
  • 9
  • 65
  • 94
  • I definitely should have mentioned that I have some empty strings (not NULLS) in certain ranges where they aren't provided (it's not the cleanest data in the world). Each color does match up to a 'range', but that range may be empty. I could amend this code with an if statement that checks if the range is empty first to put the string 'not provided' in the additional column? But I'm not sure how that would affect the justify function. – shakebeforeopening Jul 08 '19 at 22:02
3

Here is using groupby split the df and most of step handled by the definition , which means you do not need input the different color each time

l=[]
for name,x  in df.groupby('B',sort=False):
    s1=(x.A >= x.filter(like=name).iloc[:, 0]) & (x.A <= x.filter(like=name).iloc[:, 1])
    s2=x.A<x.filter(like=name).iloc[:, 0]
    l.extend(np.select([s1,s2],['inside','below'],default='above').tolist())

df['in_range']=l
df
Out[64]: 
    A     B  red_lower  red_upper  blue_lower  blue_upper in_range
0   5   red          2          6           8          10   inside
1  11   red          3          9           6          15    above
2   8  blue          0          3           5          10   inside
3   2  blue          1          2           3           4    below
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I think the issue here is my 'color' strings might not show up *exactly* in the column name - there are extra dashes between words, and there are capitalization issues, so the filter isn't perfect. When running this, I get a 'single positional indexer is out-of-bounds' error. I'll try to play with it to edit my column names and B entries to make them match up nicer perhaps? – shakebeforeopening Jul 08 '19 at 21:58
  • 1
    This needed lots of tweaking for the mess of data I'm working with, but this was really helpful, thank you! – shakebeforeopening Jul 09 '19 at 17:42
  • I want to edit this because there is a problem in the code here. This base code DOES NOT WORK if the colors are mixed. If this table is changed to red/blue/red/blue, the designations are off. I'm going to uncheck this as a solution for now - any thoughts @WeNYoBen ? – shakebeforeopening Jul 10 '19 at 20:31
  • @shakebeforeopening then you should sort your columns first , then after assign it back , you can always sort_index get back the original ordered ..My code work , just need to have more effort on it, Like`df=df.sort_values('B')`, then using above code , then `df=df.sort_index()` – BENY Jul 10 '19 at 22:17