Pandas GroupBy ID and filtering

Question

I'm kind of new to coding in Python and I need your help.

My original dataframe is:

import pandas as pd
df=pd.DataFrame({'ProductArn': [ 'arn:aws:securityhub:eu-central-1::product/aws/securityhub', 'arn:aws:securityhub:eu-central-1::product/aws/securityhub', 'arn:aws:securityhub:eu-central-1::product/aws/securityhub', 'arn:aws:securityhub:eu-central-1::product/aws/securityhub', 'arn:aws:securityhub:eu-central-1::product/aws/securityhub', 'arn:aws:securityhub:eu-central-1::product/aws/securityhub', 'arn:aws:securityhub:eu-central-1::product/aws/securityhub', 'arn:aws:securityhub:eu-central-1::product/aws/securityhub'], 
                 'GeneratorId': [ 'aws-foundational-security-best practices/v/1.0.0/SecretsManager.4', 'aws-foundational-security-best-practices/v/1.0.0/EC2.6', 'aws-foundational-security-best-practices/v/1.0.0/S3.4', 'aws-foundational-security-best-practices/v/1.0.0/S3.5', 'aws-foundational-guardduty-practices/v/1.0.0/SecretsManager.4', 'aws-foundational-splitfunction-practices/v/1.0.0/SecretsManager.4', 'aws-foundational-security-best-practices/v/1.0.0/S3.5', 'aws-foundational-security-best-practices/v/1.0.0/S3.5'], 
                 'AwsAccountId': [ 961225000000.0, 961225000000.0, 961225000000.0, 961225000000.0, 961225000000.0, 961225000000.0, 971225000000.0, 971225000000.0],
                 'Types': ['Software and Configuration Checks/Industry and Regulatory Standards/AWS-Foundational-Security-Best-Practices', 'Software and Configuration Checks/Industry and Regulatory Standards/AWS-Foundational-Security-Best-Practices', 'Software and Configuration Checks/Industry and Regulatory Standards/AWS-Foundational-Security-Best-Practices', 'Software and Configuration Checks/Industry and Regulatory Standards/AWS-Foundational-Security-Best-Practices','Software and Configuration Checks/Industry and Regulatory Standards/AWS-Foundational-Security-Best-Practices', 'Software and Configuration Checks/Industry and Regulatory Standards/AWS-Foundational-Security-Best-Practices?', 'Software and Configuration Checks/Industry and Regulatory Standards/AWS-Foundational-Security-Best-Practices?', 'Software and Configuration Checks/Industry and Regulatory Standards/AWS-Foundational-Security-Best-Practices?'],
                 'Severity': [ '{Product: 40, Normalized: 40}', '{Product: 40, Normalized: 40}', '{Product: 40, Normalized: 40}', '{Product: 40, Normalized: 40}', '{Product: 40, Normalized: 40}','{Product: 40, Normalized: 40}', '{Product: 40, Normalized: 40}', '{Product: 40, Normalized: 40}'],
                 'Title': ['SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days', 'SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days', 'SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days', 'SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days', 'SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days', 'SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days', 'SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days', 'SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days', 'SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days', 'SecretsManager.4 Secrets Manager secrets should be rotated within a specified number of days'],
                 'ProductFields':['{StandardsArn: arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0}', '{StandardsArn: arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0}', '{StandardsArn: arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0}', '{StandardsArn: arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0}', '{StandardsArn: arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0}', '{StandardsArn: arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0}', {StandardsArn: arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0}', {StandardsArn: arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0}'],
                 'Compliance': ['{Status: FAILED}', '{Status: FAILED}', '{Status: FAILED}', '{Status: FAILED}', '{Status: FAILED}', '{Status: FAILED}', '{Status: FAILED}', '{Status: FAILED}'], 
                 'WorkflowState': [ 'NEW', 'NEW', 'NEW', 'NEW', 'NEW', 'NEW', 'NEW', 'NEW' ]})

As a final output I want to filter “generatorid” that contain “best-practice” plus the ones that contain “guardduty” and join both by “awsaccountid”.

So since my data frame has two awsaccountid and in generatorid I have 4 unique rows that contain "best-practice" for the awsaccountid 961225000000.0 and one that contains "guardduty" for aws accountid 961225000000.0 and 1 unique value for "best-practice" for the awsaccountid 971225000000.0, the final csv should output only 6 rows and its outputing the original dataset.

What I coded so far was:

pd.Series(["ProductArn", "GeneratorId", "Types", "Severity","Title", "ProductFields","Compliance","WorkflowState" ], dtype="string")
pd.Series(["ProductArn", "GeneratorId", "Types", "Severity","Title", "ProductFields","Compliance","WorkflowState"], dtype=pd.StringDtype())

df['AwsAccountId'] = df['AwsAccountId'].apply(np.int64)

df.groupby(['AwsAccountId']).filter(lambda gr: gr.GeneratorId.str.contains("best-practice","guardduty").any())

but this groupby is not outputing what I need

  In [3]: iwantthis + plus the rest of the other columns & their values: ProductArn, Types, Severity, Title, ProductFields, Compliance, WorkflowState 
  Out[3]:
     AwsAccountId  GeneratorId                                                    
  0  961225000000 aws-foundational-security-best-practices/v/1.0.0/SecretsManager.4                  
  1  961225000000 aws-foundational-security-best-practices/v/1.0.0/EC2.6
  2  961225000000 aws-foundational-security-best-practices/v/1.0.0/S3.4
  3  961225000000 aws-foundational-security-best-practices/v/1.0.0/S3.5
  4  961225000000 aws-foundational-guardduty- practices/v/1.0.0/SecretsManager.4
  5  971225000000 aws-foundational-security-best-practices/v/1.0.0/S3.5

Can someone help?

Thank you.

Please include a _small_ subset of your data as a __copyable__ piece of code that can be used for testing as well as your expected output for the __provided__ data. See [MRE - Minimal, Reproducible, Example](https://stackoverflow.com/help/minimal-reproducible-example), and [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888) for more information. — Henry Ecker, Aug 14 '21 at 18:07
To recreate your DataFrame we would have to transcribe the image of your DataFrame to be able to run your code block. Whereas you could run `df.head().to_dict()` and immediately create a reproduceable constructor. Good general rule of thumb is to ensure your code block can be copied into a clean workspace and produce the type of results you're looking to change. — Henry Ecker, Aug 14 '21 at 18:09
Thanks Henry Ecker for the tips ;). I think like this is better! — Adriana Ferro, Aug 14 '21 at 18:43

Oskar_U · Answer 1 · 2021-08-19T16:57:36.690

updated answer:

please note, that provided sample contains indeed 2 unique "AwsAccountId" but with different distribution of features you're looking for (check for typos in the sample):

# create temporary columns with appropriate flags and display pivot table
df.loc[
    df["GeneratorId"].str.contains("best-practice"), 'which one'
] = 'best-practice'
df.loc[
    df["GeneratorId"].str.contains("guardduty"), 'which one'
] = 'guardduty'
df.pivot_table(
    index='AwsAccountId', 
    columns='which one', 
    values='WorkflowState', 
    aggfunc='count', fill_value=0)

which one       best-practice  guardduty
AwsAccountId                            
961225000000.0              3          1
971225000000.0              2          0

Regarding "GeneratorId" being or not being string - let's check types of data per columns first with:

df.dtypes

AwsAccountId     float64
GeneratorId       object
ProductArn        object
Types             object
Severity          object
Title             object
ProductFields     object
Compliance        object
WorkflowState     object
dtype: object

dtype "object", if data is homogenous, means most likely just string, so I would not expect this one to be the issue with grouping or filtering.

Let's try filtering the dataframe, note using pipe character | for providing alternatives for the str.contains() function as this function requires pattern to search and not freely passed words to search for (see pandas doc)

df[
    df["GeneratorId"].str.contains("best-practice|guardduty")
]

     AwsAccountId                                        GeneratorId  ...        Compliance WorkflowState
1  961225000000.0  aws-foundational-security-best-practices/v/1.0...  ...  {Status: FAILED}           NEW
2  961225000000.0  aws-foundational-security-best-practices/v/1.0...  ...  {Status: FAILED}           NEW
3  961225000000.0  aws-foundational-security-best-practices/v/1.0...  ...  {Status: FAILED}           NEW
4  961225000000.0  aws-foundational-guardduty-practices/v/1.0.0/S...  ...  {Status: FAILED}           NEW
6  971225000000.0  aws-foundational-security-best-practices/v/1.0...  ...  {Status: FAILED}           NEW
7  971225000000.0  aws-foundational-security-best-practices/v/1.0...  ...  {Status: FAILED}           NEW

[6 rows x 9 columns]

Note how without using groupby nor filter you get all 6 entries you were looking for.

If you indeed want to further refine the dataset and drop duplicates try the following:

df[
    df["GeneratorId"].str.contains("best-practice|guardduty")
].drop_duplicates()

drop_duplicates()
     AwsAccountId                                        GeneratorId  ...        Compliance WorkflowState
1  961225000000.0  aws-foundational-security-best-practices/v/1.0...  ...  {Status: FAILED}           NEW
2  961225000000.0  aws-foundational-security-best-practices/v/1.0...  ...  {Status: FAILED}           NEW
3  961225000000.0  aws-foundational-security-best-practices/v/1.0...  ...  {Status: FAILED}           NEW
4  961225000000.0  aws-foundational-guardduty-practices/v/1.0.0/S...  ...  {Status: FAILED}           NEW
6  971225000000.0  aws-foundational-security-best-practices/v/1.0...  ...  {Status: FAILED}           NEW

[5 rows x 9 columns]

previous answer:

is groupby even needed in your request? As I understand expected result is just a subset of dataframe with certain conditions:

df['AwsAccountId'] = df['AwsAccountId'].astype(int)
outcome = df[df['GeneratorId'].str.contains('best-practice|guardduty')][['AwsAccountId', 'GeneratorId']]

Oskar_U, I need the groupby cause I might have more that one AwsAccountId, should have put it in my question, but wanted to keep a small subset of my data. — Adriana Ferro, Aug 16 '21 at 13:50
And I think my problem is that the code that I run to transform the object datatype of 'GeneratorId' into string is not working, therefore I can't filter it — Adriana Ferro, Aug 16 '21 at 13:54
just edited my question a bit and @Oskar_U, just tried your code and I'm still getting the same number of rows of my original dataset. — Adriana Ferro, Aug 16 '21 at 16:09

Pandas GroupBy ID and filtering

1 Answers1

updated answer:

previous answer: