0

I have a for loop that is taking a subsample of my original dataset, doing a prediction from a previously fit model, and then i need to match the target value from the original dataframe to the prediction to calculate a different value.

20 lines from original subsample:

index       f0        f1         f2          product
89641   11.758713   -2.548885   5.007187    134.766305
30665   7.134050    -7.369558   3.990141    107.813044
71148   -13.860892  -2.727111   4.995418    137.945408
63263   -1.949113   6.340399    4.999270    134.766305
34301   2.741874    -5.114227   1.990971    57.085625
28150   -9.194978   -8.220917   4.000539    110.992147
37974   5.416532    -6.685454   3.997102    107.813044
63541   8.116958    -0.106199   1.992089    53.906522
69007   -0.886114   -8.732907   3.004329    84.038886
8808    -10.138814  -5.428649   3.996867    110.992147
77082   -7.427920   -9.558472   5.002233    137.945408
30523   0.780631    -1.872719   1.000312    30.132364
78523   3.096930    -6.854314   3.000831    84.038886
66519   4.459357    -6.787551   4.994414    134.766305
69231   10.113738   -10.433003  4.004866    107.813044
48418   -17.092959  -3.294716   1.999222    57.085625
59715   -0.970615   -1.741134   2.012687    57.085625
30159   -7.075355   -16.977595  4.997697    137.945408
34763   5.850225    -5.069475   2.994821    80.859783
99239   -8.493579   -8.126316   1.004643    30.132364

code:

r2_revenue = []
for i in range(1000):
    subsample = r2_test.sample(500,replace=True)
    features = subsample.drop(['product'],axis=1)
    predict = model2.predict(features)
    top_200 = pd.Series(predict).sort_values(ascending=False).iloc[:200]
    target = subsample['product'].isin(top_200)
    result = (revenue(target).sum())
    r2_revenue.append(result)

so, my "target" needs to find the index of each top_200 entry and then find the resulting entry in the ['product'] from the original subsample.

i am striking out on finding the way to take the index number from the series top_200 and find the corresponding product value from the original dataset.

i feel like i am missing something obvious, but searches like "matching an index from a series to a value in a dataframe" are turning up results for a single dataframe, not a series to a dataframe.

if i were looking up data, i'd use a .query() but i don't know how to do that with an index to an index?

any input would be greatly appreciated!

:Edit to help clarify (hopefully):

so my series top_200 is predictions from the subsample dataframe. the index from the series should be the same as the index from the subsample dataframe. based on the index for a particular row, i want to look up a value in the product column of the subsample dataframe with the same index number.

so here is an example output for that series:

303    139.893243
203    138.886222
21     138.561583
296    138.535309
391    138.491757

the rows are 303,203,21,296 and 391. i now want to get the value in the column product from the subsample dataframe for the rows 303,203,21,296 and 391.

seuadr
  • 141
  • 11
  • There is much in your [mre] that seems unrelated to your question you should try to write your examples so that they do not rely on something we cannot use. [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – wwii Mar 24 '21 at 14:48

1 Answers1

1

When you apply a condition to a Series the result is a boolean Series.

>>> s = pd.Series(range(10))
>>> s
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64
>>> q = s % 2 == 0
>>> q
0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

You can then use that boolean Series to filter the original.

>>> s[q]
0    0
2    2
4    4
6    6
8    8
dtype: int64

You can obtain the indices of the True values and use that to select from a like indexed Series

>>> q[q].index
Int64Index([0, 2, 4, 6, 8], dtype='int64')
>>> indices = q[q].index
>>> s[indices]
0    0
2    2
4    4
6    6
8    8
dtype: int64

>>> df = pd.DataFrame({'ex':range(10),'wye':list('zyxwvutsrq')}, index=list('abcdefghij'))
>>> df
   ex wye
a   0   z
b   1   y
c   2   x
d   3   w
e   4   v
f   5   u
g   6   t
h   7   s
i   8   r
j   9   q
>>> m = df.ex.isin([2,4,6,8])
>>> m
a    False
b    False
c     True
d    False
e     True
f    False
g     True
h    False
i     True
j    False
Name: ex, dtype: bool

>>> df.loc[m,'wye']
c    x
e    v
g    t
i    r
Name: wye, dtype: object

>>> m[m].index
Index(['c', 'e', 'g', 'i'], dtype='object')
>>> r = m[m].index
>>> df.loc[r,:]
   ex wye
c   2   x
e   4   v
g   6   t
i   8   r

the rows are 303,203,21,296 and 391. i now want to get the value in the column product from the subsample dataframe for the rows 303,203,21,296 and 391

In my example, the rows that meet the condition have the indices Index(['c', 'e', 'g', 'i'], dtype='object') and can be used to select the same rows of the 'wye' column.

>>> df.loc[r,'wye']
c    x
e    v
g    t
i    r
Name: wye, dtype: object

The indices were obtained by filtering the boolean Series for all the True values and accessing the index attribute of the result.

>>> m[m].index
Index(['c', 'e', 'g', 'i'], dtype='object')
>>> r = m[m].index

Pandas User Guide: Selection
Indexing and selecting data

wwii
  • 23,232
  • 7
  • 37
  • 77
  • wwii - thanks for the answer! forgive me, i think i am a little slow picking this up - so for my example with a series (s) and a dataframe (df) i'm still not totally clear how to compare the index of s to the index of df to get my boolean results in the first place? that is - i guess i'm not sure how to pick the index of the series as a condition? – seuadr Mar 24 '21 at 14:12
  • Do you think my made up example illustrates what you are trying to do? I wasn't sure about that. – wwii Mar 24 '21 at 14:17
  • i may be explaining this poorly as well. i'll edit the question to clarify more in depth, but, essentially, i have a series taken from a dataframe - i'm doing some stuff with that series, and then picking some values from it. i then want to match the index value for that series back up with the dataframe. so the index numbers should match between the series and the dataframe, and i want to match the series index 1 with the dataframe index 1 and get a value from the column `products` based on that index number. – seuadr Mar 24 '21 at 14:20
  • That is what my example shows - how to use a boolean series to filter/select/index rows in a Series or DataFrame and alternatively get the indices of the True values of the boolean Series and use those to select/filter/index a Series or DataFrame. – wwii Mar 24 '21 at 14:29
  • thanks for the additional explanation - sorry for the poorly structured question, brand new to all of this. :) Have a good day! – seuadr Mar 24 '21 at 15:18