struggling with a file exercise

Question

I am a beginner in python trying to solve a file exercise. The exercise says: Write a function that takes the name of a file (a text file that contains lines of words) and returns a dictionary of the consecutive characters (if present) in each line.

Each line has to be taken as a single word. In other words, the spaces that separate the characters in a line must be ignored.

The keys of the dictionary represent the repeated characters and the values the amount of times a character is repeated in the file string.

For example: For the following words present in the text file

casa a amalfi
azione estremizzata
ripasso organizzato

the dictionary must return the following keys and values:

{'a':1, 'e':1, 'z':2, 's':1, 'o':1}

With the code i've written i manage to get these desired values. However, the dictionary also shows some keys and values which shouldn't be present. I only want the repeated characters and the amount of times they are repeated in each line.

In the attempt to solve this issue, i tried deleting the items whose values are equal to zero using a for loop. But it doesn't work. Instead, i get a runtime error that says: : dictionary changed size during iteration

Here is my code

def conta_lettere (filename) : 
    
    dizionario = {}
    prev_char = None
    flag = 0
    with open(filename) as f:
        for riga in f:
            riga = ''.join(riga.split())
            for parola in riga:
                for lettera in parola:
                    if lettera not in dizionario:
                        dizionario[lettera] = 0
                if lettera == prev_char and flag !=0:
                    dizionario[lettera] +=1
                    flag = 0
                else:
                    flag = 1
                prev_char = lettera
        for chiave,valore in dizionario.items():
            if valore == 0:
                del dizionario[chiave] 
    return dizionario

Any help will be appreciated

This is the output i get:

{'c': 0,
 'a': 1,
 's': 1,
 'm': 0,
 'l': 0,
 'f': 0,
 'i': 0,
 'z': 2,
 'o': 1,
 'n': 0,
 'e': 1,
 't': 0,
 'r': 0,
 'p': 0,
 'g': 0}

do you mean you want to count consecutive letters frequencies in each line? — , Aug 12 '20 at 12:06
I kind of don't understand the desired dictionary, i.e. why 'a' has value 1 when it is more than once in each line? — Ruli, Aug 12 '20 at 12:10
@Ruli: because it's not a letter frequency. Ignoring spaces, the first line reads "casaaamalfi". There's only one run of "a" of 2+ chars in all three lines. — Sergio Tulentsev, Aug 12 '20 at 12:11
Because if you ignore the spaces, 'aa', the letter frequency is only present in the first line — alex108, Aug 12 '20 at 12:12
do you want to simply count the characters per line? or count them by document? or do you want to count the maximum number of times a character is repeated? like "z" which is repeated 2x in "estremizzata" — Andreas, Aug 12 '20 at 12:12
You can either: 1) not initialize the missing values to 0 and instead insert 1 directly 2) when deleting entries with 0 values, do it by building a new dict, copying only k-v pairs with non-zero values — Sergio Tulentsev, Aug 12 '20 at 12:13

Nikith Clan · Answer 1 · 2020-08-12T12:37:39.367

Since you want to read the file line by line, I suggest you to use f.readline(). It will give a list containing lines in the file.

If you want to remove spaces from a string, converting it to list and joining it is a bad method. You can use string replace method:

riga = riga.replace(" ", "")

This will remove all blank spaces.

For checking if consecutive letter are same, use a iterator from index 0 to last - 1.

for i in range(0, len(line) - 1):

if(line[i] == line[i+1]):

You can use dict.keys() to get a list of all keys in dictionary. So you can use simple if(letter is in dict.keys()) condition to check if a letter is in dictionary and then decide whether to insert to dictionary or increment the value of counter. This way you won't have to add any unnecessary letters as keys to your dictionary.

score 0 · Answer 2 · answered Aug 12 '20 at 12:41

Try this:

def returner(file):
    dic = {}
    with open(file) as f:
        lines = f.read().split('\n')
        for line in lines:
            line = line.replace(' ', '')
            count=1
            if len(line)>1:
                for i in range(1,len(line)):
                   if line[i-1]==line[i]:
                      count+=1
                   else :
                        if count > 1:
                            if line[i-1] in dic.keys():
                                dic[line[i-1]] += 1
                            else:
                                dic[line[i-1]] = 1
                        count=1
    return dic

returner('path/to/the/file')

it returns {'a': 1, 'e': 1, 'z': 2, 's': 1, 'o': 1} with your specified file — omdo, Aug 12 '20 at 12:43
You can avoid replacing spaces with nothing, just skip the spaces as a first thing in the loop. — Sergio Tulentsev, Aug 12 '20 at 12:47

score 0 · Answer 3 · edited Aug 23 '20 at 21:11

Proudly solved it :-)

from itertools import groupby 

s = "zioonne  estreemizzataa"

groups = groupby(s) 

result = [(label, sum(1 for _ in group)) for label, group in groups]

z = dict(result)

print(z) # check first success


delete = [] 

for key, val in z.items(): 

  if key == " " or val == 1: 

     delete.append(key) 

for i in delete: 

  del z[i] 

print(z) # check final success

output

    {'z': 2, 'i': 1, 'a': 2, 'o': 2, 'n': 2, 'e': 2, ' ': 2, 's': 1, 't': 1, 'r': 1, 'm': 1}

{'z': 2, 'a': 2, 'o': 2, 'n': 2, 'e': 2}

Last dict is cleaned to print dict of same consecutive letters only and without count of empty spaces even if its greater than 1.

score 0 · Answer 4 · answered Aug 13 '20 at 21:59

The way to solve a larger problem is to break it down into smaller problems, and then solve each of them in-turn (possibly by again breaking...). In this case: 1/ read the file, 2/ prepare the data for analysis, 3/ analyse the data, 4/ report the results. These represent a common data-science sequence.

1/ There are two methods for reading a file. Yes, it could be read line by line (as suggested elsewhere), but given that the quantity of data is small, why not use one command to read the entire file into a single string?

Take a look at this string. Apart from the letters, there are spaces, and one/a couple of other characters. NB These vary by operating system! It/they mark the ends of lines. (although you need to understand this concept)

Clarification: because of the question's wording ("lines"), I am assuming that if a line ends with a letter that is the same as the first letter on the successive line, such does NOT count!

2/ We need to "clean" the data by removing the spaces. Are you aware of the "null character" or "null string"/"nulstring"? There is a Python string function which enables the replacement of one string-character with another. Replace the spaces with 'nothing' and then we have "casaa..." and thus our first 'match'. There is no need to worry about the line-endings - they won't match any letters, or each-other (but these could also be removed, if desired).

3/ To analyse the data, please imagine doing it on-paper (or a whiteboard - a great code-design tool!). Write the characters in a column. Now, the problem appears to be comparing 'this character' with the one below it. However, this gives rise to the complication - what to do at 'the bottom' (where there is no 'next character')?

Instead, create a second column of characters to the right of the first BUT put the second input-character at the top of this second column, follow it with all the others, and add the 'first-character' at the bottom. ("and the first shall be last"!). Now, the problem can be visualised as checking 'across': is 'this' character in the left column the same as the corresponding character in the right column?

When it comes to doing this in Python, you could use two lists; but equally, you could elect to remain with strings (the input 'arrives' as a string, so is changing to a list of characters 'extra work'?)

Having two strings (or lists) to process, most find it necessary to make Python's for-loop work like some-other-language's for-loop. Don't do this: Python's is a "for each" loop designed to access each member of a collection in-turn, whereas others' for-loops are designed to provide "pointers" or "counters" which is a marsh/bog of opportunity for error.

However, the need here, is to process TWO collections (a string is a collection of characters!) at the same time. Python offers a function which allows us to zip two strings/lists/tuples/... together as if they were one entity - but organised pair-wise (cf "concatenation"). Sound familiar? This result (actually, a mechanism) can then passed to the for(each)-loop.

All you have to do (sounds so easy when someone else says it!) is to compare the 'left-character' with the 'right-character', and if they match, count them using a dictionary.

There's a(nother) problem here: the easiest way to 'count' is to use "+= 1", except that it assumes a zero-value the first time we count a letter. There are solutions, eg defaultdicts, but you might also review the dictionary function which gets a value if the dictionary-key (this letter) already exists, or returns a default value if it does not (when counting, zero).

In this way, you won't have a larger dictionary than necessary, full of zero-counts - which you would then have to remove/edit-out in the next step.

4/ Reporting the results is a matter of looping through the dictionary of counters and reporting the frequency of character-multiples.

Given that this is obviously a student-assignment, you won't learn if I give you the answer as code. However, the 'key words' (above) should be apparent - you could/should look-up any Python commands you wish, for yourself (https://docs.python.org/3/index.html). Similarly, any ComSc terms you need to make familiar. Remember that if you open the Python interactive shell, or a REPL, you will be able to quickly experiment with 'new' constructs and ideas!

Thus, counting lines of code (LoC) from my own experiment/proof: 1/ 2 lines 2/ 2 lines 3/ 3 lines as for-loop 4/ 1 or 2 or... lines, depending upon how 'fancy' you'd like the output!

Programmers progress by asking one simple question (which in my case is likely born of an apparent 'laziness'): "surely there's an easier way to do this?". Look at the built-in functions provided by Python, and make use of its power (balanced by ensuring that your code is readable), rather than trying to make it look like C, Java, ... - or per the 'life advice' "listen (read the manuals) before/more than you talk (write the code)"...

score -1 · Answer 5 · answered Aug 12 '20 at 12:13

-1

replace return dizionario with:

for key, val in dizionario.items():
    if val == 0:
        del dizionario[key]
return dizionario

Let me know if this works.

answered Aug 12 '20 at 12:13

Adarsh TS

193
15

Why do you think it should work? It's not materially different, looks like? – Sergio Tulentsev Aug 12 '20 at 12:14
@SergioTulentsev ```dictionary changed size during iteration``` probably means the dict is iterated using either only ```d.keys()``` or ```d.values()```. By doing so it throws this error when an item is deleted/added since ```dict``` is unordered. – Adarsh TS Aug 13 '20 at 00:45
Not sure what you mean by all that, but the OP has this exact loop in their code and it throws that error. – Sergio Tulentsev Aug 13 '20 at 06:50

score -1 · Answer 6 · answered Aug 12 '20 at 12:22

-1

I understood your example as if you want to count the maximum reapeting of a character per line without empty spaces. You could do this by updating the "count" of a dictionary by increasing it by 1 if the character is the same as the previous chracter. This way you only need to go through the string once.

def count_max_repetitions(string):
    clean_string = "".join(string.split())
    dict_max_repetition = {x:1 for x in set(clean_string)}
    previous = ""
    for c in clean_string:
        if c == previous:
            dict_max_repetition[c] += 1
        previous = c
    return dict_max_repetition

string = "casa a amalfi"
count_max_repetitions(string)
#Out[27]: {'a': 3, 'm': 1, 'i': 1, 'l': 1, 's': 1, 'c': 1, 'f': 1}

Additional examples:

string = "azione estremizzata"
count_max_repetitions(string)
# Out[28]: 
# {'t': 1,
#  'a': 1,
#  'r': 1,
#  'm': 1,
#  'n': 1,
#  'i': 1,
#  's': 1,
#  'z': 2,
#  'o': 1,
#  'e': 2}

string = "ripasso organizzato"
count_max_repetitions(string)
# Out[29]: 
# {'p': 1,
#  't': 1,
#  'a': 1,
#  'r': 1,
#  'i': 1,
#  'n': 1,
#  's': 2,
#  'g': 1,
#  'z': 2,
#  'o': 2}

answered Aug 12 '20 at 12:22

Andreas

8,694
3
14
38

Your return values clearly disagree with what is specified in the question (which, presumably, comes from the task description given by their professor or education website) – Sergio Tulentsev Aug 12 '20 at 12:24
That is why I asked what he means in the comments, because it is differntly to what he asked. The example doesn't fit his question. I can now either try to answer the question or recreate the example. I went with the question. – Andreas Aug 12 '20 at 12:34
Agree that it's worded a bit confusingly. In cases like this, I assume bad english skills either on my side or OP's and try to reinterpret the words through the prism of code. Words are wind, code doesn't lie. In this case, what they apparently mean is "if there are contiguous runs of the same character, how many runs are there per character, across the whole file?" – Sergio Tulentsev Aug 12 '20 at 12:53
does this helps you https://stackoverflow.com/questions/34443946/count-consecutive-characters – Aug 12 '20 at 14:34

struggling with a file exercise

6 Answers6