0

I'm trying to find occurrences of several pairs of words in strings which are in a list in a tsv file. A list in a tsv file is below.

0 ILDIGCGRGRHARALVRRGWQVTGLDLSEDAVAAARSRVADDDLDV...

1 AELETLQAKINPHFLYNSLNSIASLVYTDPEKAEKMVLMLSKLFRV...

2 AQLSSLKEQLNPHFLFNTFNTLYGISLKYPERVPDLIMHTSQLMRY...

3 TEIKALQSQIKPHFLFNTLNAIRCTIINNNNDKAADLVYKLAMLLR...

4 SEMSRLNAQINPHFLFNTLNFFYSEVRTLHPKISESILLLSDIMRY... ...

...1000 SELSFLKAQINPHFFFNTLNNIYALTMMDVASAQEALHRLSRMMRY...

1001 ILEPGCGTGRLMLALAEHGHHVAGVDASATALEFCRERLTQHGLTG...

1002 IADLGAGEGTISQLMAQRAKRVIAIDNSEKMVEFGAELARKHGIAN...

1003 AELRALRAQISPHFIYNALAAIASFVRTDPERARELLLEFADFSRY...

1004 VVDLGCGSGASTDALVNSMGHRGETYAAIGIDASAGMLTEAHSKPW...

[1005 rows x 1 columns]

then, I'd like to get occurrences of AA, AB, AC, ...ZY, ZZ for each row. An example is below.

If there is a string "AEAETLQAKIN" in a row, then I'd like to get the result below.

(the definition of strings must be acid. ex)acid='AEAETLQAKIN')

IN[]......(I'd like to know how to describe codes which can get occurrences here. )

OUT[] AA: 0, AC: 0, AD: 0, AE: 2, ... AK: 1, ... EA: 1, ...

michel
  • 1
  • 1
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Mar 19 '22 at 10:23

1 Answers1

0

If you want a dict containing only existing pairs, use a defaultdict

from collections import defaultdict
from string import ascii_uppercase

def occurrences(content):
    result = defaultdict(int)
    for i in range(len(content) - 1):
        result[content[i:i + 2]] += 1
    return result

If you want to also have the 0, so 26x26=676 pairs, prepare one dict before

from itertools import product

OCCURRENCE_DEFAULT = {f"{x}{y}": 0 for x, y in product(ascii_uppercase, repeat=2)}

def occurrences(content):
    result = OCCURRENCE_DEFAULT.copy()
    for i in range(len(content) - 1):
        result[content[i:i + 2]] += 1
    return result

Then apply on each string of your content

value = ["0 ILDIGCGRGRHARALVRRGWQVTGLDLSEDAVAAARSRVADDDLDV",
         "4 SEMSRLNAQINPHFLFNTLNFFYSEVRTLHPKISESILLLSDIMRY"]
for row in value:
    occ = occurrences(row.split()[1])
    print(occ)
azro
  • 53,056
  • 7
  • 34
  • 70
  • Thank you for your kind answer. I have an additional question. Is there a way to process like above you answered with loop through each 0~1004 string in a list? I have a tsv file named 'df_tsv_2' which is listed strings on this question. – michel Mar 19 '22 at 10:53
  • @michel so you ask how to read the file and use apply the logic to each line ? – azro Mar 19 '22 at 11:44
  • yes, I do. Could you give me some ideas if you don't mind? I did some research with books and on the internet but, I've just started studying ML with Python and have not been able to come up with ideas on how to apply them. – michel Mar 19 '22 at 12:07
  • @michel you didn't find stuff ? That is basic, I let you search again, here some help https://stackoverflow.com/questions/53283718/reading-a-file-line-by-line-in-python – azro Mar 19 '22 at 18:18