5

In the example below I am trying to check if the "Value" in Table 1 is in the range of values in a row in Table 2 for columns "Start" and "Stop". If the value is in that range I want to return the type of "Fruit". The between method seems to be able to accomplish this but unsure how to apply it to a row from another table. Note I need to perform this task using a large dataset and am open to using methods outside of the pandas library.

enter image description here

Sample Code

df1 = pd.DataFrame({'Date': {0: '06-01', 1: '06-02', 2: '06-03', 3: '06-04'},
                   'Value': {0: 3, 1: 7, 2: 9, 3: 16}, })

df2 = pd.DataFrame({'Start': {0: 1, 1: 6, 2: 11, 3: 16},
                    'Stop': {0: 5, 1: 10, 2: 15, 3: 20},
                    'Fruit': {0: 'Apple', 1: 'Orange', 2: 'Pear', 3: 'Mango'},})

Table 1

    Date  Value
0  06-01      3
1  06-02      7
2  06-03      9
3  06-04     16

Table 2

    Fruit  Start  Stop
0   Apple      1     5
1  Orange      6    10
2    Pear     11    15
3   Mango     16    20

Table 1 Desired Output

    Date  Value  Fruit
0  06-01      3   Apple
1  06-02      7  Orange
2  06-03      9  Orange
3  06-04     16   Mango
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
MBasith
  • 1,407
  • 4
  • 29
  • 48
  • 1
    I recommend IntervalIndex. https://stackoverflow.com/questions/56777623/how-to-aggregate-mean-of-rows-when-values-are-in-between-range-of-2-other-column/56777757#56777757 – cs95 Jun 26 '19 at 23:29
  • Also see my answer here: https://stackoverflow.com/a/46526392/4909087 – cs95 Jun 26 '19 at 23:30
  • Btw, why is fruit `None` at value `9`? Shouldn't that be `Orange`? – Erfan Jun 26 '19 at 23:46
  • You right, 9 should be Orange. – MBasith Jun 26 '19 at 23:52
  • 1
    Then the answer of @cs95 linked works: First: `idx = pd.IntervalIndex.from_arrays(df2['Start'], df2['Stop'], closed='both')` Second: `df['fruit'] = df2.loc[idx.get_indexer(df1['Value']), 'Fruit']` – Erfan Jun 26 '19 at 23:55
  • I am trying to use the code that @cs95 posted but I am getting error `ValueError: cannot reindex from a duplicate axis`. This is what I tried - `idx = pd.IntervalIndex.from_arrays(df2['Start'], df2['Stop'], closed='both') df1['fruit'] = df2.loc[idx.get_indexer(df1['Value']), 'Fruit']`. – MBasith Jun 27 '19 at 00:01

1 Answers1

4

Here is on of the way rather than using IntervalIndex , we check with numpy board-cast

s1=df2.Start.values
s2=df2.Stop.values
s=df1.Value.values[:,None]
np.dot((s>=s1)&(s<=s2),df2.Fruit)
array(['Apple', 'Orange', 'Orange', 'Mango'], dtype=object)
BENY
  • 317,841
  • 20
  • 164
  • 234