Regular expression to replace string separated by comma with thier sum in pandas dataframe

Question

I have a tab-separated data frame which looks like (for example):

   A                                 B                      C
gene1  AHX21832.1                        EEL39984.1,ARO60330.1  EEL39984.1
gene2  EEL39984.1,ARO60330.1             ARO60330.1             ARO60330.1
gene3  AYF09030.1,EEL37774.1,AQY42173.1  AQY42173.1             AQY42173.1

The following script work well on list:

values = ["AHX21832.1", "EEL39984.1,ARO60330.1", "AYF09030.1,EEL37774.1,AQY42173.1"]

script

How I can implement this script on my panda's data frame? Since there is no re.findall in pandas.

Here, the data frame is messed. https://drive.google.com/open?id=1Y8x0WQdAbGGcqfOeRsLUYvi8SVZsq72z here is an example of my data. Some cells contain "EEL39984.1,ARO60330.1", which is separated by comma. I would like to replace it with their sum. — Dilfuza Djamalova, May 20 '20 at 22:01
use this as a guide on how to post questions on stack overflow : [guide](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — sammywemmy, May 20 '20 at 22:33
here is the link https://docs.google.com/spreadsheets/d/1TGBMOY121gyZcYUc5Gc9CiR8V0rUT-57EZ_-pVOXHeQ/edit?usp=sharing — Dilfuza Djamalova, May 20 '20 at 22:34

hostingutilities.com · Answer 1 · 2020-05-20T23:10:56.157

0

Take a look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.findall.html. It looks like it is possible to do do the equivalent of re.findall on a dataframe.

for column, data in df.iteritems():
    res = data.str.findall("[A-Z0-9]\.(\d+)")

So for the code you posted in your repl.it link, you could get the same results by doing:

import pandas as pd

values = pd.Series(["AHX21832.1",
"EEL39984.1,ARO60330.1",
"AYF09030.1,EEL37774.1,AQY42173.1"])

res = values.str.findall("[A-Z0-9]\.(\d+)")

for x in res:
    print("Found", x)
print("total", res.shape[0])

edited May 20 '20 at 23:10

answered May 20 '20 at 22:56

hostingutilities.com

8,894
3
41
51

I could use res = data.str.findall("[A-Z0-9]\.(\d+)") to find pattern, but stil cannot replace it with its sum – Dilfuza Djamalova May 20 '20 at 23:15
Is this what you're wanting: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html#pandas-series-str-count – hostingutilities.com May 20 '20 at 23:28
no. Values should be replaced by its occurrence in each cell – Dilfuza Djamalova May 20 '20 at 23:31

Regular expression to replace string separated by comma with thier sum in pandas dataframe

1 Answers1