0

I have a dataframe and one of the columns roughly looks like as shown below. Is there any way to rename rows? Rows should be renamed as psPARP8, psEXOC8, psTMEM128, psCFHR3. Where ps represents pseudogene and and the term in bracket is the code for that pseudogene. I will highly appreciate if anyone can can make a python function or any alternative to perform this task.

d = {'gene_final': ["1poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene", 
                "exocyst complex component 8 (EXOC8) pseudogene",
               "transmembrane protein 128 (TMEM128) pseudogene",
               "complement factor H related 3 (CFHR3) pseudogene",
                "mitochondrially encoded NADH 4L dehydrogenase (MT-ND4L) pseudogene",
                "relaxin family peptide/INSL5 receptor 4 (RXFP4 ) pseudogene",
                "nasGBP7and GBP2"
                
               ]}

df = pd.DataFrame(data=d)

The desired output should look like this

gene_final
-----------
psPARP8
psEXOC8
psTMEM128
psCFHR3
psMT-ND4L
psRXFP4
nasGBP2
  • Did you try `df.rename(columns={'orig_name': 'new_name', 'orig_name_2': 'new_name_2'})`? – Drecker Feb 02 '22 at 10:13
  • What do you mean by "the rows of column gene_final should be renamed"? Do you mean the index of the dataframe should have this name? If so, how would this only affect the the column `gene_final` - the index name applies to the entire dataframe? I suggest providing a valid sample input and a description of the desired output. See also [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Mr. T Feb 02 '22 at 10:18
  • Hi, unfortunately this will not work, because it involves a lot of manual work. In fact, my dataframe consist of thousands of rows. I have been trying to automate this but could'nt achieve success. – Alok Chauhan Feb 02 '22 at 10:20
  • Yes, index name applies to the entire dataframe. Sry for the confusion. – Alok Chauhan Feb 02 '22 at 10:22
  • So, you want to extract for each row the content of the last parenthesis (is it always the last? will it always be followed by the term "pseudogene"?), add "ps" before this name, and rename the index of the dataframe with it? – Mr. T Feb 02 '22 at 10:25
  • yes, exactly. Also I have updated my post, please check the desired out that I want. – Alok Chauhan Feb 02 '22 at 10:38

2 Answers2

1
import pandas as pd
from regex import regex

# build dataframe
df = pd.DataFrame({'gene_final': ["poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene",
                                  "exocyst complex component 8 (EXOC8) pseudogene",
                                  "transmembrane protein 128 (TMEM128) pseudogene",
                                  "complement factor H related 3 (CFHR3) pseudogene"]})


def extract_name(s):
    """Helper function to extract ps name """
    s = regex.findall(r"\s\((\S*)\s?\)", s)[0] # find a word between ' (' and ' )'
    s = f"ps{s}" # add ps to string
    return s

# apply function extract_name() to each row
df['gene_final'] = df['gene_final'].apply(extract_name)
print(df)
>   gene_final
> 0    psPARP8
> 1    psEXOC8
> 2  psTMEM128
> 3    psCFHR3
> 4  psMT-ND4L
> 5    psRXFP4

psychOle
  • 1,054
  • 9
  • 19
  • Thank you @psychOle. This function seems to work. However, I am gonna as you one more favour :). When I apply this function to my entire dataframe, I am getting an error (IndexError: list index out of range). And, I know why. There are other rows too which will not mach desired regex format. Any ideas how to avoid this. – Alok Chauhan Feb 02 '22 at 11:53
  • Please update your question and include these cases in your example dataframe. – psychOle Feb 02 '22 at 13:12
  • I edited my example dataframe (included last three rows). Thnx in advance. – Alok Chauhan Feb 02 '22 at 16:20
  • I updated my answer. – psychOle Feb 02 '22 at 19:13
0

I think you are saying about index names (rows): This is how you change the row names in DataFrames:

import pandas as pd

df = pd.DataFrame({'A': [11, 21, 31],
                   'B': [12, 22, 32],
                   'C': [13, 23, 33]},
                  index=['ONE', 'TWO', 'THREE'])

print(df)

and you can change the row names after building dataframe also like this:

df_new = df.rename(columns={'A': 'Col_1'}, index={'ONE': 'Row_1'})
print(df_new)
#        Col_1   B   C
# Row_1     11  12  13
# TWO       21  22  23
# THREE     31  32  33

print(df)
#         A   B   C
# ONE    11  12  13
# TWO    21  22  23
# THREE  31  32  33
Sarim Sikander
  • 351
  • 2
  • 15