0

I have a tab-separated data frame which looks like (for example):

   A                                 B                      C
gene1  AHX21832.1                        EEL39984.1,ARO60330.1  EEL39984.1
gene2  EEL39984.1,ARO60330.1             ARO60330.1             ARO60330.1
gene3  AYF09030.1,EEL37774.1,AQY42173.1  AQY42173.1             AQY42173.1

The following script work well on list:

values = ["AHX21832.1", "EEL39984.1,ARO60330.1", "AYF09030.1,EEL37774.1,AQY42173.1"]

script

How I can implement this script on my panda's data frame? Since there is no re.findall in pandas.

Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
  • 1
    Here, the data frame is messed. https://drive.google.com/open?id=1Y8x0WQdAbGGcqfOeRsLUYvi8SVZsq72z here is an example of my data. Some cells contain "EEL39984.1,ARO60330.1", which is separated by comma. I would like to replace it with their sum. – Dilfuza Djamalova May 20 '20 at 22:01
  • kindly post ur expected output, in dataframe form – sammywemmy May 20 '20 at 22:29
  • use this as a guide on how to post questions on stack overflow : [guide](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – sammywemmy May 20 '20 at 22:33
  • here is the link https://docs.google.com/spreadsheets/d/1TGBMOY121gyZcYUc5Gc9CiR8V0rUT-57EZ_-pVOXHeQ/edit?usp=sharing – Dilfuza Djamalova May 20 '20 at 22:34

1 Answers1

0

Take a look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.findall.html. It looks like it is possible to do do the equivalent of re.findall on a dataframe.

for column, data in df.iteritems():
    res = data.str.findall("[A-Z0-9]\.(\d+)")

So for the code you posted in your repl.it link, you could get the same results by doing:

import pandas as pd

values = pd.Series(["AHX21832.1",
"EEL39984.1,ARO60330.1",
"AYF09030.1,EEL37774.1,AQY42173.1"])

res = values.str.findall("[A-Z0-9]\.(\d+)")

for x in res:
    print("Found", x)
print("total", res.shape[0])
hostingutilities.com
  • 8,894
  • 3
  • 41
  • 51