0

I have a main pandas dataframe that contains 3 columns. pagepath, pageviews, pagedate. I would like to detect the elements that have same pagepath and make seperate pandas dataframes for each of these rows that have same pagepath. the main dataframe

there will be a lot of groups, however in the image, it is possible to observe /news/AAL, and /news/ZVO. But if we dive into the list in more details, there are a lot of groups. I could not think of a way to find rows with same pagepath.

Thanks for your helps.

Burak
  • 35
  • 6
  • **[Don't Post Screenshots](https://meta.stackoverflow.com/questions/303812/)**. Always provide a [mre], with **code, data, errors, current output, and expected output, as [formatted text](https://stackoverflow.com/help/formatting)**. It's likely the question will be down-voted and closed. You're discouraging assistance because no one wants to retype your data or code, and screenshots are often illegible. [edit] the question and **add text**. Please see [How to provide a reproducible copy of your DataFrame using `df.head(15).to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246). – Trenton McKinney Feb 06 '21 at 16:27
  • okay thank you, will do – Burak Feb 06 '21 at 16:49

1 Answers1

1

You are in the right path. You just need to create a regex capturing group with the same pattern you used to filter the rows, but using the () around the desired group. Use str.extract with expand=False to return the first (and only) capturing group as a Series. Then, apply this Series as argument to the groupby function. This function returns separated pandas dataframes for each of these groups, which you can access through a for loop, or by using groupByObj.get_group(groupName).

File sample.csv used as input

ga:pagePath, ga:pageviews, ga:pageDate
/news/AAL/1004553, 2958, 1612569600
/news/AAL/1004553, 9158, 1612569600
/news/BLX/2004553, 9258, 1612569600
...
...
/news/JKK/1005553, 4558, 1612569600
/news/ZZP/2034553, 7338, 1612569600
/news/ZZP/6004553, 9458, 1612569600
/news/ZZP/4004553, 8858, 1612569600
import pandas as pd

df = pd.read_csv("sample.csv")
print(df)

regex = r"^/news/([A-Z]{3})/.*"

groups = df["ga:pagePath"].str.extract(regex, expand=False)
page_groups = df.groupby(groups)

for groupName, dfGroup in page_groups:
    print(f"------- {groupName} -------")
    print(dfGroup)

Output from page_groups

------- AAL -------
         ga:pagePath   ga:pageviews   ga:pageDate
0  /news/AAL/1004553           2958    1612569600
1  /news/AAL/1004553           9158    1612569600
------- BLX -------
         ga:pagePath   ga:pageviews   ga:pageDate
2  /news/BLX/2004553           9258    1612569600
...
...
------- JKK -------
          ga:pagePath   ga:pageviews   ga:pageDate
13  /news/JKK/2009553           1458    1612569600
14  /news/JKK/1005553           4558    1612569600
------- ZZP -------
          ga:pagePath   ga:pageviews   ga:pageDate
15  /news/ZZP/2034553           7338    1612569600
16  /news/ZZP/6004553           9458    1612569600
17  /news/ZZP/4004553           8858    1612569600
n1colas.m
  • 3,863
  • 4
  • 15
  • 28