How to find/filter/combine based on common prefix in rows and columns with use of python/pandas?

Question

I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.

How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix , and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities e.g. AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.

Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib .. still clueless on how to set this up

any help appreciated Thanks!

Update:

@Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).

I tried to solve this by slicing the strings:

eng.Locality.str.slice(0,4,1)

i managed to slice off the suffices so that the remainders = AA3_ , AA4_ and so on. Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.

eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()

Any more help out there? As you can see above - I need it :-)

Welcome to Stack Overflow! Please include any relevant information [as text directly into your question](https://stackoverflow.com/editing-help), do not link or embed external images of source code or data. Images make it difficult to efficiently assist you as they cannot be copied and offer poor usability as they cannot be searched. See: [Why not upload images of code/errors when asking a question?](https://meta.stackoverflow.com/q/285551/15497888) — Henry Ecker, Jun 08 '21 at 23:47
If you need assistance formatting a small sample of your DataFrame as a copyable piece of code for SO see [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888). — Henry Ecker, Jun 08 '21 at 23:47

score 0 · Answer 1 · answered Jun 09 '21 at 00:33

In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.

Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:

sum = eng.groupby(["Locality"]).sum()

Additional commands and sorting styles can be found here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Lost_my_Password · Answer 2 · 2021-06-09T22:00:50.207

0

I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:

Edited the Locality-ID to remove suffix:eng["Locality"]=eng["Locality].str.slice(0,4,1)
Used the groupby function:sum = eng.groupby(["Locality"]).sum()
End result: Table

edited Jun 09 '21 at 22:00

answered Jun 09 '21 at 17:34

Lost_my_Password

1
3

Guess I could use regex to cut the Locality-IDs even cleaner... – Lost_my_Password Jun 10 '21 at 23:53

How to find/filter/combine based on common prefix in rows and columns with use of python/pandas?

2 Answers2