3

I have a "lookup" table formatted as such:

Min | Max | Val
  1 |  99 | "Principal"
100 | 199 | "Partner"
... | ... | ...

There is a CURRENT_POINTS series in my dataframe that is between Min or Max (inclusive).

Question: how do I create a VAL column that is based on the above lookup table? My initial thought was to use df.lookup, but there are 800K rows in df and so the two tables are not equally sized.

Thanks in advance for your help!

Any thoughts?

user791411
  • 149
  • 12

1 Answers1

3

I would use cut() method.

Assuming you have the following DFs:

In [187]: lkp
Out[187]:
   Min  Max  Val
0    1   99  AAA
1  100  199  BBB
2  200  299  CCC
3  300  399  DDD

In [188]: df
Out[188]:
   CURRENT_POINTS
0              55
1              10
2              20
3             144
4             194
5             143
6             397
7             233
8             128
9             215

Using cut() method we can produce a new column of a category dtype, which might save a lot of memory:

In [189]: df['Val'] = pd.cut(df.CURRENT_POINTS,
     ...:                    bins=[0] + lkp[['Min','Max']].stack()[1::2].tolist(),
     ...:                    labels=lkp.Val.tolist())
     ...:

In [190]: df
Out[190]:
   CURRENT_POINTS  Val
0              55  AAA
1              10  AAA
2              20  AAA
3             144  BBB
4             194  BBB
5             143  BBB
6             397  DDD
7             233  CCC
8             128  BBB
9             215  CCC

In [191]: df.dtypes
Out[191]:
CURRENT_POINTS       int32
Val               category
dtype: object

Category dtype can save a lot of memory:

In [192]: big = pd.concat([df] * 10**5, ignore_index=True)

In [193]: big.shape
Out[193]: (1000000, 2)

In [194]: big['str_col'] = 'AAA'

In [198]: big.dtypes
Out[198]:
CURRENT_POINTS       int32
Val               category
str_col             object
dtype: object

In [195]: big.memory_usage()
Out[195]:
Index                  80
CURRENT_POINTS    4000000
Val               1000032     # <--- `category` column takes 1 byte per row (plus 32 bytes overhead)
str_col           8000000

In [197]: big.head()
Out[197]:
   CURRENT_POINTS  Val str_col
0              55  AAA     AAA
1              10  AAA     AAA
2              20  AAA     AAA
3             144  BBB     AAA
4             194  BBB     AAA

NOTE: pay attention at memory usage for the category column Val and for the str_col column (dtype: object)

Explanation:

bins:

In [199]: lkp[['Min','Max']]
Out[199]:
   Min  Max
0    1   99
1  100  199
2  200  299
3  300  399

In [200]: lkp[['Min','Max']].stack()
Out[200]:
0  Min      1
   Max     99
1  Min    100
   Max    199
2  Min    200
   Max    299
3  Min    300
   Max    399
dtype: int64

In [201]: lkp[['Min','Max']].stack()[1::2].tolist()
Out[201]: [99, 199, 299, 399]

In [202]: [0] + lkp[['Min','Max']].stack()[1::2].tolist()
Out[202]: [0, 99, 199, 299, 399]

labels:

In [203]: lkp.Val.tolist()
Out[203]: ['AAA', 'BBB', 'CCC', 'DDD']

NOTE: lkp must be sorted by ['Min', 'Max'] before using it for bins and labels.

Here is a small demo for sorting:

In [2]: lkp
Out[2]:
   Min  Max  Val
0  300  399  DDD
1  100  199  BBB
2    1   99  AAA
3  200  299  CCC

In [4]: lkp = lkp.sort_values(['Min','Max'])

In [5]: lkp
Out[5]:
   Min  Max  Val
2    1   99  AAA
1  100  199  BBB
3  200  299  CCC
0  300  399  DDD
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thanks so much for the response! Would I then just `as.type(str)` to the `Val` column created above to reduce memory load? – user791411 Nov 14 '16 at 15:19
  • 1
    @user791411, you are welcome! No, you want to keep it as `category` - it's just a "tuned" string dtype. I've added `str_col` in order to show the difference in memory usage – MaxU - stand with Ukraine Nov 14 '16 at 15:21
  • OK thanks! **Question**: I received an error message saying `"ValueError: bins must increase monotonically."` How should I change the code? – user791411 Nov 14 '16 at 15:34
  • @user791411, as i wrote at the end of my answer - the lookup DF must be sorted... – MaxU - stand with Ukraine Nov 14 '16 at 15:40
  • Works great! Thanks so much! – user791411 Nov 14 '16 at 18:51
  • I get "ufunc 'subtract' did not contain a loop with signature matching types dtype(' – Man Feb 05 '19 at 21:10
  • 1
    @Mandi, i would recommend you to open a new question and post there a small reproducible sample data set, which would help us to reproduce your error... You can also check yourself that the posted solution is working properly - just create datasets from the answer using `pd.read_clipboard()` function and try to run the same code. – MaxU - stand with Ukraine Feb 05 '19 at 21:33