0

I have fasta file which contains around 900k protein sequences - below is the first 3 for example:

>NP_000011.2 serine/threonine-protein kinase receptor R3 precursor [Homo sapiens]
MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTCRGAWCTVVLVREEGRHPQEHRGCGNLHRELCRGR
PTEFVNHYCCDSHLCNHNVSLVLEATQPPSEQPGTDGQLALILGPVLALLALVALGVLGLWHVRRRQEKQRGLHSELGES
>NP_000012.1 presenilin-1 isoform I-467 [Homo sapiens]
MTELPAPLSYFQNAQMSEDNHLSNTVRSQNDNRERQEHNDRRSLGHPEPLSNGRPQGNSRQVVEQDEEEDEELTLKYGAK
HVIMLFVPVTLCMVVVVATIKSVSFYTRKDGQLIYTPFTEDTETVGQRALHSILNAAIMISVIVVMTILLVVLYKYRCYK
>NP_000013.2 adenosine deaminase isoform 1 [Homo sapiens]
MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMPAIAGCREAIK
RIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEGERDFGVKARSILCCMRHQPN

I want to have them into a dataframe with the proper columns names looking like this:

ID              name                             sapiens        sequence  
>NP_000011.2    serine/threonine-protein kinase  [Homo sapiens] MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTCRGAWCTVVLVREEGRHPQEHRGCGNLHRELCRGRPTEFVNHY CDSHLCNHNVSLVLEATQPPSEQPGTDGQLALILGPVLALLALVALGVLGLWHVRRRQEKQRGLHSELGES
>NP_000012.1    presenilin-1 isoform I-467       [Homo sapiens] MTELPAPLSYFQNAQMSEDNHLSNTVRSQNDNRERQEHNDRRSLGHPEPLSNGRPQGNSRQVVEQDEEEDEELTLKYGAKHVIMLFVPVTLCMVVVVATIKSVSFYTRKDGQLIYTPFTEDTETVGQRALHSILNAAIMISVIVVMTILLVVLYKYRCYK
>NP_000013.2    adenosine deaminase isoform 1    [Homo sapiens] MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMPAIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEGERDFGVKARSILCCMRHQPN

none of the below methods worked

df = open('sample.faa','r')
lines = df.readlines()
df.close()
for index, line in enumerate(lines):
      lines[index] = line.strip()
df_result = pd.DataFrame(columns=('ID', 'name'))
i = 0
ID = "" 
name = ""  
for line in lines:
    if 'X' in line:
        ID = line.replace('X', "")
    else:
        name = re.sub(r']', "", line)
        df_result.loc[i] = [ID, name]
        i =i+1
f = open('sample.faa', encoding='utf8')
df = pd.DataFrame(f)
df
data = pd.read_csv('sample.faa', sep=',')
  • https://stackoverflow.com/questions/19436789/biopython-seqio-to-pandas-dataframe has a recipe for converting a FASTA file to dataframes with BioPython but you'd have to extend that to split the title into id, name, and species (which is not hard at all per se). – tripleee Nov 03 '21 at 18:47

1 Answers1

0

Assuming, you were able to read the file as a text, so now you have a list like:

lines = [
    ">NP_000011.2 serine/threonine-protein kinase receptor R3 precursor [Homo sapiens] MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTCRGAWCTVVLVREEGRHPQEHRGCGNLHRELCRGRPTEFVNHYCCDSHLCNHNVSLVLEATQPPSEQPGTDGQLALILGPVLALLALVALGVLGLWHVRRRQEKQRGLHSELGES",
    ">NP_000012.1 presenilin-1 isoform I-467 [Homo sapiens] MTELPAPLSYFQNAQMSEDNHLSNTVRSQNDNRERQEHNDRRSLGHPEPLSNGRPQGNSRQVVEQDEEEDEELTLKYGAKHVIMLFVPVTLCMVVVVATIKSVSFYTRKDGQLIYTPFTEDTETVGQRALHSILNAAIMISVIVVMTILLVVLYKYRCYK",
    ">NP_000013.2 adenosine deaminase isoform 1 [Homo sapiens] MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMPAIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEGERDFGVKARSILCCMRHQPN"
]

To convert it into the required dataframe, you can just iterate and do something like:

frame = {k : [] for k in ["ID", "name", "sapiens", "sequence"]}

for line in lines:
    ID, *series, sequence = line.strip().split()
    
    # we want to process and find the name and sapiens
    # let's get the original series back, and process to find
    series = " ".join(series).split(" [")
    name = series[0]
    sapiens = "[" + series[1]
    
    for k, v in zip(["ID", "name", "sapiens", "sequence"], [ID, name, sapiens, sequence]):
        frame[k].append(v)
    
dataframe = pd.DataFrame(frame)

Output Image enter image description here

Mr. Hobo
  • 530
  • 1
  • 7
  • 22
  • this showed the following error: ValueError: not enough values to unpack (expected at least 2, got 1) – Mohammad Alshehri Nov 03 '21 at 19:42
  • Sorry for the delay, was out of the station. I guess `ValueError` is raised at `sapiens = "[" + series[1]` if so, then there might be some different types of lines other than the ones you've provided. Please provide the data sample. I've also updated the answer with an o/p screenshot. – Mr. Hobo Nov 10 '21 at 07:26