0

I have a dataset like below:

enter image description here

Which needs to be checked against the below master dataset:

enter image description here

Here, the condition is to check the associated value for the Group and its value from the dataset with the master dataset and pick the S2SFlag column value from the master dataset and create an S2SFlag column in the dataset. The original dataset is returned by querying the database using AWS Lambda.

Also, if post-checking if there is a 'No' value for any of the Group attributes then the Flag for that particular group would be 'No'. The expected output is as below:

enter image description here

As it can be seen that for the first record in the expected output, S2SFlag for each of the group attributes is Yes so the S2SFlag for the first record is 'Yes' and in the second case it is No because one of the values of Age has No as its S2SFlag.

Attempt so far:

df1 = pd.read_csv('/home/siddhesh/Documents/data.csv')

df2 = pd.read_csv('/home/siddhesh/Documents/data1.csv')  # Assuming it as query results

age = df2['Age']
Channel = df2['Channel']
Status = df2['Status']  

df3 = df1[(df1['Attributes'] = str(age))]  

if (age == df3.iloc[0]):
    getS2s = df1['S2SFlag']
    if (getS2s == 'No'):
        df1['S2SFlag'] = 'No'  

It gives me the below error:

Traceback (most recent call last):
  File "/home/siddhesh/Downloads/pyspark/src/sample/test.py", line 16, in <module>
    if (age == df3.iloc[0]):
  File "/home/siddhesh/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 931, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/siddhesh/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1566, in _getitem_axis
    self._validate_integer(key, axis)
  File "/home/siddhesh/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1500, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
whatsinthename
  • 1,828
  • 20
  • 59
  • That's because `df3` is empty. And that's because you have a typo when you assign it, should be `== str(age)`, not `= str(age)`. – fsl Feb 08 '22 at 15:39
  • I corrected it but it gives me the same error maybe because of something else now. Can I use some kind of iteration on df column values and get what I want? I am new to Pandas. So any hints would be appreciated – whatsinthename Feb 08 '22 at 15:51

1 Answers1

1

We can first create mappings for each group:

mapping_Age = df1[df1['Group'] == 'Age'].set_index('Attributes')['S2SFlag'] == 'Yes'

In the above line, we filter to keep only Age rows and use their attributes values as the key of this mapping, but instead of using the S2SFlag as the value of this mapping, I changed them from Yes to True and from No to False as it's more program-friendly to process boolean values. We will change them back to Yes/No later. Note that we need to make sure the same Attributes value does not appear twice in any mapping. Otherwise it will report errors in later step. Similarly

mapping_Channel = df1[df1['Group'] == 'Channel'].set_index('Attributes')['S2SFlag'] == 'Yes'

mapping_Status = df1[df1['Group'] == 'Status'].set_index('Attributes')['S2SFlag'] == 'Yes'

We can then apply the mappings to our data like the following

mapped_Age = df2['Age'].astype(str).map(mapping_Age)
mapped_Channel = df2['Channel'].map(mapping_Channel)
mapped_Status = df2['Status'].map(mapping_Status)

Note that if map cannot find a mapping from (e.g.) mapping_Age for a particular value in df2['Age'], it will return a np.nan instead of a True/False, you need to define how to handle this case by appending .fillna(True) or .fillna(False) at the end of each of them.

Lastly, your final S2SFlag is Yes (or True) only if all are Yess, so

df2['S2SFlag'] = (mapped_Age & mapped_Status & mapped_Channel).map({True: 'Yes', False: 'No'})
Raymond Kwok
  • 2,461
  • 2
  • 9
  • 11
  • Hey, Thank you so much for your detailed answer. But when I tried it gives me output as 'No' only for both the records but according to my expectations it should give 'No' for the second record only since the flag value for `30` is No. What needs to be changed? – whatsinthename Feb 09 '22 at 05:41
  • I added `astype` to `mapped_Age = df2['Age'].astype(str).map(mapping_Age)`. Please try again. – Raymond Kwok Feb 09 '22 at 05:47
  • Yes, since I am getting it as a string from the query results as well, casting is needed. Thank you so much though for your answer. I was thinking of applying UDF and all but you made my day :) – whatsinthename Feb 09 '22 at 05:49
  • My pleasure. :) – Raymond Kwok Feb 09 '22 at 05:51
  • Just a question, If there is a range mentioned in the master table for `Age` like `20-30` and in the dataset, the value for `Age` is `25`...How do I Check this while mapping? – whatsinthename Feb 09 '22 at 06:36
  • then my code won't work. One workaround is to break down the range into individual rows where one row has one `Age` value. – Raymond Kwok Feb 09 '22 at 06:53
  • Okay, so you mean to say something like explode while retaining other column values? – whatsinthename Feb 09 '22 at 06:55
  • Yes, this is what I mean. – Raymond Kwok Feb 09 '22 at 06:55
  • Just FYI. The good thing about exploding it is that it is easier for you to check if an Age shows up multiple times. Obviously if you have a range 20-30 and another range 29-49, then there is an overlap, and you can get rid of that by inspecting `mapping_Age`. I mention about this because exploding ranges is not the only workaround. Another approach would be `pd.merge` plus some conditional checks but that could be way more messy so I would recommend to explode the master table. – Raymond Kwok Feb 09 '22 at 07:01
  • Yeah, I am gonna try to explode the master table only but for now, I just want to check if my age is `not null` or `blank`. So how can I do it considering your code? – whatsinthename Feb 09 '22 at 07:03
  • `df2['Age'].isna()` will return you a series of `True/False` where `True` indicates that the corresponding `Age` is null. – Raymond Kwok Feb 09 '22 at 07:05
  • Do I need to change any mapping for this? What I mean to say is irrespective of the value present in the master table, the mapping should return `Yes` if there is some value for `Age` – whatsinthename Feb 09 '22 at 07:09
  • use this: `mapped_Age = ~df2['Age'].isna()`. Now the mapping is irrelvant, and it only checks if `Age` is null or not. we use `~` to invert the True/False to make null a `False` (or `No`) – Raymond Kwok Feb 09 '22 at 07:11
  • I replaced `mapped_Age = df2['Age'].astype(str).map(mapping_Age)` with `mapped_Age = ~df2['Age'].isna()` but result is the same – whatsinthename Feb 09 '22 at 07:15
  • For the last table in your question, your master table has to have `Lapse` for `Channel` in order to make its `S2SFlag` be `Yes`. However, I do not see that from the master table in your question. Can you check on it? Also, You can actually print out `mapped_Age`, `mapped_Channel`, and `mapped_Status` to see which one is `False` that causes the final `S2SFlag` be `False`. – Raymond Kwok Feb 09 '22 at 07:19
  • Yeah, my bad. It worked...Thank you so much again for answering my naive questions. – whatsinthename Feb 09 '22 at 07:22
  • 1
    Again, my pleasure :) – Raymond Kwok Feb 09 '22 at 07:23
  • I am trying to modify your answer by adding one range type column...can you please have a look over here https://stackoverflow.com/questions/71052625/how-to-map-for-range-values-in-python-pandas – whatsinthename Feb 09 '22 at 16:25
  • Hey, sorry for getting back late. I am in the Hong Kong time zone. Please check that question for my answer – Raymond Kwok Feb 09 '22 at 21:21
  • It worked. Thank you so much again :) – whatsinthename Feb 10 '22 at 05:08