I have fasta file which contains around 900k protein sequences - below is the first 3 for example:
>NP_000011.2 serine/threonine-protein kinase receptor R3 precursor [Homo sapiens]
MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTCRGAWCTVVLVREEGRHPQEHRGCGNLHRELCRGR
PTEFVNHYCCDSHLCNHNVSLVLEATQPPSEQPGTDGQLALILGPVLALLALVALGVLGLWHVRRRQEKQRGLHSELGES
>NP_000012.1 presenilin-1 isoform I-467 [Homo sapiens]
MTELPAPLSYFQNAQMSEDNHLSNTVRSQNDNRERQEHNDRRSLGHPEPLSNGRPQGNSRQVVEQDEEEDEELTLKYGAK
HVIMLFVPVTLCMVVVVATIKSVSFYTRKDGQLIYTPFTEDTETVGQRALHSILNAAIMISVIVVMTILLVVLYKYRCYK
>NP_000013.2 adenosine deaminase isoform 1 [Homo sapiens]
MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMPAIAGCREAIK
RIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEGERDFGVKARSILCCMRHQPN
I want to have them into a dataframe with the proper columns names looking like this:
ID name sapiens sequence
>NP_000011.2 serine/threonine-protein kinase [Homo sapiens] MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTCRGAWCTVVLVREEGRHPQEHRGCGNLHRELCRGRPTEFVNHY CDSHLCNHNVSLVLEATQPPSEQPGTDGQLALILGPVLALLALVALGVLGLWHVRRRQEKQRGLHSELGES
>NP_000012.1 presenilin-1 isoform I-467 [Homo sapiens] MTELPAPLSYFQNAQMSEDNHLSNTVRSQNDNRERQEHNDRRSLGHPEPLSNGRPQGNSRQVVEQDEEEDEELTLKYGAKHVIMLFVPVTLCMVVVVATIKSVSFYTRKDGQLIYTPFTEDTETVGQRALHSILNAAIMISVIVVMTILLVVLYKYRCYK
>NP_000013.2 adenosine deaminase isoform 1 [Homo sapiens] MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMPAIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEGERDFGVKARSILCCMRHQPN
none of the below methods worked
df = open('sample.faa','r')
lines = df.readlines()
df.close()
for index, line in enumerate(lines):
lines[index] = line.strip()
df_result = pd.DataFrame(columns=('ID', 'name'))
i = 0
ID = ""
name = ""
for line in lines:
if 'X' in line:
ID = line.replace('X', "")
else:
name = re.sub(r']', "", line)
df_result.loc[i] = [ID, name]
i =i+1
f = open('sample.faa', encoding='utf8')
df = pd.DataFrame(f)
df
data = pd.read_csv('sample.faa', sep=',')