3

I am struggling to find the sum of count column based on subtring present in a column Name. The substring should co exist with the other multiple values present in another column that is Error Name. If substring (e.g. Ehsan) matches and another column i.e. Error Name has those multiple values (Device and Line Error) then i would some the count in Count Column. Remember I have to sum only those count which has substring Ehsan in Name and Device and Line Error in Error Name Below is my Raw Data:

enter image description here

And my output should look like this:

enter image description here

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
Filbadeha
  • 389
  • 2
  • 17

2 Answers2

2

We can use str.contains and sum here:

total = df[df["Name"].str.contains(r'\bEhsan\b', flags=re.I, na=False, regex=True)]["Count"].sum()
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • I have some NA/NaN values as well in my actual data and giving this error when i applied your code. ValueError: Cannot mask with non-boolean array containing NA / NaN values – Filbadeha Jun 19 '22 at 11:06
  • 1
    @Ehsan125 Try adding the `na=False` flag to the `str.contains` call. – Tim Biegeleisen Jun 19 '22 at 11:12
  • Thanks it worked , what if i want to show each Different occurrence of Ehsan along Device and Line Error and report the count separately in Dataframe? Can you please then modify your code? – Filbadeha Jun 19 '22 at 11:16
1

After Editing The Question: You need to select rows that contain Ehsan and then use pandas.groupby on the result dataframe like below:

mask_name = df['Name'].str.contains(r'.*(?:Ehsan).*')
mask_err = df['Error Name'].str.contains(r'(?:\bLine Error\b|\bDevice\b)')
df = df[mask_name & mask_err]
df.groupby(['Name', 'Error Name'])['Count'].sum().reset_index()

Before Editing The Question: You can write a mask for columns Name and Error Name with regex then select rows that have True in two masks and sum Count for those rows with pandas.loc and pandas.sum like below:

mask_name = df['Name'].str.contains(r'.*(?:Ehsan).*')
mask_err = df['Error Name'].str.contains(r'(?:\bLine Error\b|\bDevice\b)')
df.loc[mask_name & mask_err, 'Count'].sum()
I'mahdi
  • 23,382
  • 5
  • 22
  • 30
  • Thanks it worked , what if i want to show each Different occurrence of Ehsan in resulted Dataframe along Device and Line Error and report the count separately in Dataframe? Can you please then modify your code? – Filbadeha Jun 19 '22 at 11:28
  • 1
    @Ehsan125, welcome, Can you add desired output to your question? – I'mahdi Jun 19 '22 at 11:29
  • i have edited the source data and output desired data, please remember now Name column has duplicate name and in final output it would be single occurrence based on Error Name Column. If Error Name would change then that may occur again. – Filbadeha Jun 19 '22 at 11:42
  • 1
    Thanks it worked , but tell me what first is doing in agg function, what is its use? And one more thing in your first solution you have to keep the space of or condition like this Line Error|Device can you please incorporate this as well in the new solution that you provided please – Filbadeha Jun 19 '22 at 12:02
  • 1
    In `agg` after groupby we can use a function on other columns like `sum`, `first`, `last`, ... but in your problem, I think you don't need `agg`. You need groupby on two columns the sum on `count`. read this [link](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) or [this](https://stackoverflow.com/questions/43172970/python-pandas-groupby-aggregate-on-multiple-columns-then-pivot) – I'mahdi Jun 19 '22 at 12:13