0

I have a dataset like below:

Customer_ID Policy_No   Age Channel Status  Income
123 5432    5   DD  Lapse   300000
123 5431    1   Agency  Active  420000
234 4352    7   DD  Lapse   650000

and a master dataset:

Group   Attributes  S2SFlag
Age 1 to 5  Yes
Age 6 to 10 No
Channel DD  Yes
Channel Agency  Yes
Status  Lapse   Yes
Status  Active  No
Income  300000 to 500000    Yes
Income  500000 to 700000    No

Here, the condition is to check the associated value for the Group and its value from the dataset with the master dataset and pick the S2SFlag column value from the master dataset and create an S2SFlag column in the dataset. The original dataset is returned by querying the database using AWS Lambda.

Also, if post-checking if there is a 'No' value for any of the Group attributes/row then the Flag for that particular group would be 'No'. The expected output is as below:

enter image description here

I have tried the below code but it is giving me the error:

import pandas as pd
import re

df1 = pd.read_csv('')  # Master table


def to_range(s):
    m = re.match('(\d+) to (\d+)$', s)
    return range(int(m.group(1)), int(m.group(2)) + 1) if m else s


df1 = (df1.assign(Attributes=[to_range(s) for s in df1['Attributes']])
       .explode('Attributes'))

df2 = pd.read_csv('')  # Assuming it as query results

mapping_Age = df1[df1['Group'] == 'Age'].set_index('Attributes')['S2SFlag'] == 'Yes'

mapping_Channel = df1[df1['Group'] == 'Channel'].set_index('Attributes')['S2SFlag'] == 'Yes'

mapping_Status = df1[df1['Group'] == 'Status'].set_index('Attributes')['S2SFlag'] == 'Yes'

mapping_Income = df1[df1['Group'] == 'Income'].set_index('Attributes')['S2SFlag'] == 'Yes'
mapped_Age = df2['Age'].map(mapping_Age)
mapped_Channel = df2['Channel'].map(mapping_Channel)
mapped_Status = df2['Status'].map(mapping_Status)
mapped_Income = df2['Income'].map(mapping_Income)

df2['S2SFlag'] = (mapped_Age & mapped_Status & mapped_Channel & mapped_Income).map({True: 'Yes', False: 'No'})  

Traceback (most recent call last):
  File "", line 27, in <module>
    mapped_Income = df2['Income'].map(mapping_Income)
  File "", line 4161, in map
    new_values = super()._map_values(arg, na_action=na_action)
  File "", line 842, in _map_values
    indexer = mapper.index.get_indexer(values)
  File "", line 3442, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Also, I tried something like below which also gave me another error:

import pandas as pd
import re

df1 = pd.read_csv('') #Master table
df2 = pd.read_csv('') #Dataset
df1 = df1[df1['Group'] == 'Income']

for i, j in df2['Income'].iteritems():
for k, l in zip(df1['Attributes'], df1['S2SFlag']):
    start = k.split(" to ")[0]
    end = k.split(" to ")[1]
    if j in range(int(start), int(end)):
        print("J ", j, " l ", l)
        df2['S2SFlag'] = l

Output:

  Customer_ID  Policy_No  Age Channel  Status  Income S2SFlag
0          123       5432    5      DD   Lapse  300000      No
1          123       5431    1  Agency  Active  420000      No
2          234       4352    7      DD   Lapse  650000      No

Where am I going wrong?

Update 1:

As per Raymond, everything worked like a charm except for one scenario where I am not getting the output as expected.

Master Data:

Group   Attributes  S2SFlag
Insured Age Group   1 to 20 Yes
Insured Age Group   21 to 25    Yes
Insured Age Group   26 to 30    Yes
Insured Age Group   31 to 40    Yes
Insured Age Group   41 to 50    Yes
Insured Age Group   51 to 100   Yes
Policy Status   Lapse   Yes
Policy Status   Active  Yes
Policy Status   RPU No
Policy Status   Paid Up/ PPT Over   Yes
Policy Status   Claim settled   Yes
Policy Status   Claim rejected  No
Income Group    0 to 300000 Yes
Income Group    300001 to 500000    Yes
Income Group    500001 to 700000    Yes
Income Group    700001 to 1000000   Yes
Income Group    1000001 to 1500000  Yes
Income Group    1500001 to 1600000  Yes
Marrital status Single  Yes
Marrital status Married Yes
Marrital status Divorced    Yes
Marrital status Married with a child    Yes
Marrital status Single Parent   Yes
Marrital status 2 children  Yes
Occupation  Salaried    Yes
Occupation  Daily wage  No
Occupation  Self employed with <5LPA    No
Occupation  Self employed with >5LPA    Yes
Occupation  House wife  Yes
Occupation  House Husband   Yes
Education   1 to 10 Yes
Education   11 to 12    Yes
Education   Diploma Yes
Education   Graduate    Yes
Education   Post Graduate   Yes
Education   Doctoral    Yes
Fund Value  Positive    Yes
Fund Value  Negative    No
Fund Value  Zero    No

When I tried printing the mapped values for Education, it gave me results like below:

mapped_Education = df2['Education'].map(mapping_Education)
print(mapped_Education)  

0     NaN
1     NaN
2    True
Name: Education, dtype: object  

Dataset:

Customer_ID Policy_No   Age Status  Income  Marital_Status  Occupation  Education   Fund Value
123 5432    5   Lapse   300000  Single  Salaried    6   Positive
123 5431    1   Active  420000  Married Daily wage  11  Zero
234 4352    7   Lapse   650000  Divorced    Daily wage  Doctoral    Zero

Due to this, I am getting output as below:

 Customer_ID  Policy_No  Age  Status  ...  Occupation Education Fund Value S2SFlag
0          123       5432    5   Lapse  ...    Salaried         6   Positive      No
1          123       5431    1  Active  ...  Daily wage        11       Zero      No
2          234       4352    7   Lapse  ...  Daily wage  Doctoral       Zero      No

Where S2SFlag for the first record should have been 'Yes'

whatsinthename
  • 1,828
  • 20
  • 59
  • Can you please post the data for the master dataset, rather an image? – Jonathan Leon Feb 09 '22 at 16:52
  • Yeah, I have modified my question with the required dataset @JonathanLeon – whatsinthename Feb 09 '22 at 17:02
  • I came to a conclusion where I figured out that the second approach will never work since my use case is different but not able to figure out what's wrong with the first approach. Can I use the `between()` method for some kind of comparison and then map? – whatsinthename Feb 09 '22 at 18:10

1 Answers1

1

The error Reindexing only valid with uniquely valued Index objects says that map requires your mapping_Income to have unique index, and the index originates from df1[df1['Group'] == 'Income'].set_index('Attributes') where the 'Attributes' for Income contains exploded values that are not unique.

Your code for exploding is range inclusive, so when you explode the two ranges for Income, 500000 will appear once in each explode, ending up with two 500000 and thus the error. Please try to modify either one of the ranges to exclude its 500000, e.g. make the second range 500001 to 700000.

For Update 1:

I suspect it has to do with data-type again. The exploded value could be integers, but all values of df2['Education'] are string.

One quick fix is to change everything into strings.

#1 add df1['Attributes'] = df1['Attributes'].astype(str) after the line for explode

#2 add df2 = df2.astype(str) before any map to make all contents strings

Raymond Kwok
  • 2,461
  • 2
  • 9
  • 11
  • It worked like a charm :) Thank you so much again for helping me out :) – whatsinthename Feb 10 '22 at 05:09
  • No problem. :). – Raymond Kwok Feb 10 '22 at 05:18
  • Just one question, since the income range is large and I am exploding it, is it a good way to do it ? I observed in IDE that the results are still faster but just wanted to check since I am not aware about the internals of Pandas API. – whatsinthename Feb 10 '22 at 05:20
  • 1
    interesting. if we forget about the explode step since it only run once, the most time-consuming part should be `df2['Age'].map(mapping_Age)`. our `mapping_Age` is a pd.Series and I am not sure if the index is **hashed** or not. If not, it may take more time to search for a mapping index. An alternative would be to do `df2['Age'].map(mapping_Age.to_dict())` which will first convert `pd.Series` into a `dict` which **will hash** the index. Changing from **not-hash** to **hash** should speed up the mapping process. You may time both and compare. Share me the result if you timed them. :) – Raymond Kwok Feb 10 '22 at 05:30
  • This https://stackoverflow.com/questions/27238066/what-is-the-point-of-indexing-in-pandas says pandas index is backed by a hash table. In this case using `dict` may not give us improvement. – Raymond Kwok Feb 10 '22 at 05:41
  • I haven't checked the timing yet but, according to you the way which we are doing right now should be faster compared to converting it to `dict()`? – whatsinthename Feb 10 '22 at 05:50
  • Sure, will share the results. Let me merge the code in the actual environment, make it run, and then will test it. – whatsinthename Feb 10 '22 at 05:51
  • `dict` has hash-table. if `pd.Series` also has a hash-table as that answer suggested, then their performance should be similar. Please don't worry about the timing check. Please only share if you are interested and did that :) (In case you do that, use `to_dict` for both Age and Income at the same time and time it, then take away `to_dict` from both at the same time and time it) – Raymond Kwok Feb 10 '22 at 05:56
  • Sure. I am actually interested because this will be used in API and timing would matter in that case. – whatsinthename Feb 10 '22 at 06:03
  • You were correct, `dict()` is little bit slower compared to when not used. Time without `dict()` = `1644474643.4004178` and with `dict()`= `1644474687.1044276`. I think it can be ignored since the difference is less but still there is a performance difference. – whatsinthename Feb 10 '22 at 06:35
  • is the performance acceptable? (e.g. did you make sure the explode only run once throughout the life of your API handler?) – Raymond Kwok Feb 10 '22 at 06:42
  • It will be directly proportional to the number of times the API is hit. – whatsinthename Feb 10 '22 at 06:44
  • I see. perhaps you have different master tables for different API call parameters. if after some time you find that the performance is not good enough, we can think about how to do it without the explode thing at all. – Raymond Kwok Feb 10 '22 at 06:46
  • Yeah, correct. Also, I am facing one issue even though the data is correct this time. Could you please check the modified question after a minute, so that I will edit the question by that time? – whatsinthename Feb 10 '22 at 06:49
  • I have added `Update 1` in the post. Could you please help me with that too? – whatsinthename Feb 10 '22 at 06:58
  • Why it is giving me `NaN` for education even though I can see that the mapping is happening fine. – whatsinthename Feb 10 '22 at 07:09
  • I suspect it has to do with data-type again. The exploded value could be integers, but all values of df2['Education'] are string. One quick fix is to change everything into strings. #1 add `df1['Attributes'] = df1['Attributes'].astype(str)` after the line for `explode` #2, add `df2 = df2.astype(str)` before any `map` to make all contents strings. – Raymond Kwok Feb 10 '22 at 07:20
  • obviously making everything string is an over-kill for Age/Income because it always only contain numbers. if you prefer not to over-kill, I can suggest code to do so. – Raymond Kwok Feb 10 '22 at 07:23
  • Your suggestion worked. It was a data type issue since the `Education` column contained both `numbers` and `string` and the `master` table had only `Int` data and whereas in `age` and `income` there are no strings, there wasn't ambiguity with these columns. So, converting everything to `string` worked. I didn't get your last comment. Is there any easy fix for this? The last part in the API would be to extract this `S2SFlag` and just pass it to the JSON response. – whatsinthename Feb 10 '22 at 07:28
  • I said it's an over-kill because we actually don't need to convert `Age` and `Income` into string because `Education` is the guy suffering the data-type problem. In case you need to do math operation on `Age` or `Income` then it is obviously not ideal to convert them into string. However, from your last comment, it seems making everything into strings won't do any harm, and in that case we can ignore that comment :) – Raymond Kwok Feb 10 '22 at 07:31
  • Yeah, for now, this will work. Do you mind if I send you a request over LinkedIn? – whatsinthename Feb 10 '22 at 07:37
  • Please send. I am waiting for that. – Raymond Kwok Feb 10 '22 at 07:38