0

I'm trying to determine if a number in 1 column (COL A) is within a range of 2 numbers found in a nested list in another column (COL B). COL B can have several nested lists of numbers.

example dataframe

This is being done in Python and pandas dataframes. I'm confident I can achieve this by using for loops but I'd like to utilize a vectorized solution. My attempts at using np.where include some variation of this (I've only gone so far as to address situations where column B only has 1 embedded list of numbers. I'm guessing I could use an embedded np.where statement in the False parameter...):

test_df['inRange'] = np.where(np.isin(test_df['COL A'],list(range(test_df['COL B'][0][0],test_df['COL B'][0][1]))), 'match', 'no match')

However, I keep getting an Index Error: list index out of range I speculate I'm not using the correct syntax to refer to embedded lists from another column when using np.where.

Any insight into how to achieve what I'm attempting is appreciated. Thank you!

James
  • 23
  • 5
  • 2
    Please provide a reproducible example, not an image, and the matching expected output. By the look of it, it's likely not possible to vectorize as you have lists. – mozway Jan 05 '23 at 17:37
  • @mozway - I was trying to post this question with an actual table but it wouldn't let me. I was prompted with an error stating that the table was being recognized as code? putting the table in quotes wasn't helping so I opted to use an image. Additionally, the output I'm looking for would be a third column that would have 'match' for all the rows present in the above table. I suppose I should have included another row that presented a 'no match' instance. Apologies for the ambiguity, thanks for asking for clarification and providing me insight! – James Jan 09 '23 at 21:15

1 Answers1

1

You can definitely do it without loops, but I don't think it'll be much better in terms of performance when compared to a looping approach (not a whole lot of truly vectorized operations; This solution is still just looping over the rows of the dataframe and the intervals in the 2d lists, pandas just does it under the hood).

Here the trick is to use a pd.IntervalIndex, but other things are possible too.

import numpy as np
import pandas as pd

def check_in_intervals(row):
    val, intervals = row
    idx = pd.IntervalIndex.from_tuples(
        list(map(tuple, intervals)),
        closed="left"
    )
    return idx.contains(val).any()

df = pd.DataFrame({
    "COL A": [19, 76, 98, 103],
    "COL B": [[[19, 38]], [[76, 84]], [[98, 129]], [[11, 23], [54, 333]]]
})

df["inRange"] = np.where(df.apply(check_in_intervals, axis=1), 'match', 'no match')

df:

   COL A                  COL B inRange
0     19             [[19, 38]]   match
1     76             [[76, 84]]   match
2     98            [[98, 129]]   match
3    103  [[11, 23], [54, 333]]   match
Chrysophylaxs
  • 5,818
  • 3
  • 10
  • 21
  • Thanks @Chrystophylaxs. Can you expand on your statement 'not a whole lot of truly vectorized operations'? Do you mean, 100% vectorized approaches are limited in where they are applicable? – James Jan 09 '23 at 21:20
  • 1
    Here is some more info on vectorization: https://stackoverflow.com/questions/1422149/what-is-vectorization In short, using .apply with a function written in python is not something that is precompiled and vectorized for super speedy code. Additionally, since you're working with python lists, I believe a fair bit of time (relatively) is being taken to convert the data to an IntervalIndex, something I'm mainly doing for clarity. You could also loop over the lists manually and do the checks, which might be quicker, but it's less clear. – Chrysophylaxs Jan 09 '23 at 22:58