You are in the right path. You just need to create a regex capturing group with the same pattern you used to filter the rows, but using the ()
around the desired group. Use str.extract
with expand=False
to return the first (and only) capturing group as a Series. Then, apply this Series as argument to the groupby function. This function returns separated pandas dataframes for each of these groups, which you can access through a for loop, or by using groupByObj.get_group(groupName)
.
File sample.csv used as input
ga:pagePath, ga:pageviews, ga:pageDate
/news/AAL/1004553, 2958, 1612569600
/news/AAL/1004553, 9158, 1612569600
/news/BLX/2004553, 9258, 1612569600
...
...
/news/JKK/1005553, 4558, 1612569600
/news/ZZP/2034553, 7338, 1612569600
/news/ZZP/6004553, 9458, 1612569600
/news/ZZP/4004553, 8858, 1612569600
import pandas as pd
df = pd.read_csv("sample.csv")
print(df)
regex = r"^/news/([A-Z]{3})/.*"
groups = df["ga:pagePath"].str.extract(regex, expand=False)
page_groups = df.groupby(groups)
for groupName, dfGroup in page_groups:
print(f"------- {groupName} -------")
print(dfGroup)
Output from page_groups
------- AAL -------
ga:pagePath ga:pageviews ga:pageDate
0 /news/AAL/1004553 2958 1612569600
1 /news/AAL/1004553 9158 1612569600
------- BLX -------
ga:pagePath ga:pageviews ga:pageDate
2 /news/BLX/2004553 9258 1612569600
...
...
------- JKK -------
ga:pagePath ga:pageviews ga:pageDate
13 /news/JKK/2009553 1458 1612569600
14 /news/JKK/1005553 4558 1612569600
------- ZZP -------
ga:pagePath ga:pageviews ga:pageDate
15 /news/ZZP/2034553 7338 1612569600
16 /news/ZZP/6004553 9458 1612569600
17 /news/ZZP/4004553 8858 1612569600