How to find and store the number of occurrences of substrings in strings into a Python dictionary?

Question

I have a problem, didn't know how to create a matrix

I have a dictionary of this type:

dico = {
"banana": "sp_345",
"apple": "ap_456",
"pear": "pe_645",

}

and a file like that:

sp_345_4567 pe_645_4567876  ap_456_45678    pe_645_4556789
sp_345_567  pe_645_45678
pe_645_45678    ap_456_345678
sp_345_56789    ap_456_345
pe_645_45678    ap_456_345678
sp_345_56789    ap_456_345
s45678  f45678  f456789 ap_456_52546135

What I want to do is to create a matrix where we find more than n times a value from the dictionary in the line.

This is how I want to proceed:

step 1 create a dictionary with the associated values and numbers of lines :

Like that:

dictionary = {'1': 'sp_345_4567','pe_645_4567876', 'ap_456_45678', 'pe_645_4556789'; '2': 'sp_345_567', 'pe_645_45678'; '3:' 'pe_645_45678','ap_456_345678'; '4:' etc ..

Then I want to make a comparison between the values with my first dictionary called dico and see for example in the number of times the banana key appears in each line (and therefore do it for all the keys of my dictionary) except that the problem is that the values of my dico are not equal to those of my dictionary because they are followed by this pattern'_\w+''

The idea would be to make a final_dict that would look like this to be able to make a matrix at the end:

final_dict = {'line1': 'Banana' : '1' ; 'Apple': '1'; 'Pear':2; 'line2': etc ...

Here is my code that don't work :

import pprint
import re
import csv

dico = {
    "banana": "sp_345",
    "apple": "ap_456",
    "pear": "pe_645",
}

dictionary = {}
final_dict = {}
cnt = 0
with open("test.txt") as file :
    reader = csv.reader(file, delimiter ='\t')
    for li in reader:
        grp = li
        number = 1
        for li in reader:
            dictionary[number] = grp
            number += 1
            pprint.pprint(dictionary)
            number_fruit = {}
            for key1, val1 in dico.items():
                for key2, val2 in dictionary.items():
                     if val1 == val2+'_\w+':
                         final_dict[key1] = val2

Thanks for the help

EDIT :

I've tried using a dict comprehension

import csv
import re

dico = {
    "banana": "sp_345",
    "apple": "ap_456",
    "pear": "pe_645",
}

with open("test.txt") as file :
    reader = csv.reader(file, delimiter ='\t')
    for li in reader:
        pattern = re.search(dico["banana"]+"_\w+", str(li))
        if pattern:
            final_dict = {"line" + str(index + 1):{key:line.count(text) for key, text in dico.items()} for index, line in enumerate(reader)}
        print(final_dict)

But when I print my final dictionary, it only put 0 for banana ...

{'line1': {'banana': 0, 'apple': 0, 'pear': 0}, 'line2': {'banana': 0, 'apple': 0, 'pear': 0}, 'line3': {'banana': 0, 'apple': 0, 'pear': 0}, 'line4': {'banana': 0, 'apple': 0, 'pear': 0}, 'line5': {'banana': 0, 'apple': 0, 'pear': 0}, 'line6': {'banana': 0, 'apple': 0, 'pear': 0}}

So yeah, now it looks like a bit more of what I wanted but the occurences doesn't rise .... :/ Maybe my condition should be inside the dict comprehension ??

Corentin Pane · Accepted Answer · 2019-11-05T10:45:45.893

1

Why it doesn't work

Your test

if val1 == val2+'_\w+':
    ...

doesn't work because you are testing string equality between val1 which could be "sp_345_4567" and val2+'_\w+', which is a string and could be litterally "sp_345_\w+'", and they are not equal.

What you could do about it

You might want to test for containment, for example

if val1 in val2:
    ...

You can check that "sp_345" in "sp_345_4567" returns true.

You might also want to actually count the number of times "sp_345" appears in another string, and you can do this using .count:

"sp_345_567  pe_645_45678".count("sp_345") # returns 1
"sp_345_567  pe_645_45678".count("_") # returns 2

You could also do it using regular expressions as you've tried to:

import re
pattern = "sp_345_" + "\\w+"

if re.match(pattern, "sp_345_4567"):
    # pattern was found! Do stuff here.
    pass

# alternatively:
print(re.findall(pattern, "sp_345_4567"))
# prints ['sp_345_4567']

How can you apply that to build your final_dict

You can rewrite your code in a much simpler way using dictionary comprehension:

import csv

dico = {
    "banana": "sp_345",
    "apple": "ap_456",
    "pear": "pe_645",
}

with open("test.txt") as file :
    reader = csv.reader(file, delimiter ='\t')
    final_dict = {"line" + str(index + 1):{key:line.count(text) for key, text in dico.items()} for index, line in enumerate(reader)}

I'm building an outer dictionary with keys like "line1", "line2"... and for each of them, the value is an inner dictionary with keys like "banana" or "apple" and each value is the number of times they appear on the line.

If you want to know how many times the banana appears on line 4, you'd use

print(final_dict["line4"]["banana"])

Note that I would recommend using a list rather than a dictionary to map results to line numbers, so that the previous query would become:

print(final_list[3]["banana"])

edited Nov 05 '19 at 10:45

answered Nov 04 '19 at 21:51

Corentin Pane

4,794
1
12
29

Thanks that did help ! But that doesn't do the final dict I want :/ – BillyPocheo Nov 05 '19 at 06:35
No it indeed doesn't, but SO isn't a place where people write code for you:) I presume you have enough to get you going, if you think you don't you can still comment on a specific issue and I'll answer, or you can even ask another question about another specific problem you're facing. – Corentin Pane Nov 05 '19 at 09:18
Yeah I know :) Thx thats really nice, I just wanted to know if it is the best strategy in order to make the matrix from the dictionnary or is there something less complicated – BillyPocheo Nov 05 '19 at 10:25
okay, i edited my post to include the construction of `final_dict` as per your remarks. Hope that helps! – Corentin Pane Nov 05 '19 at 10:39
Thanks I'll try and I'll edit my post if there are problems ! :) – BillyPocheo Nov 05 '19 at 10:53
Okay, so I tried using a condition inside my dict comprehension like it is mentioned here https://stackoverflow.com/questions/9442724/how-to-use-if-else-in-a-dictionary-comprehension but didn't worked :') (nothing is printed) – BillyPocheo Nov 05 '19 at 21:00
Please ask another question with all needed details and I'll be happy to answer it:) – Corentin Pane Nov 05 '19 at 21:46
I meant you should ask a new question on SO since the topic has shifted quite a lot. – Corentin Pane Nov 05 '19 at 22:18
1

oh okay then but you helped me to make a good dictionnary so I'll put a V – BillyPocheo Nov 05 '19 at 22:20

How to find and store the number of occurrences of substrings in strings into a Python dictionary?

1 Answers1