How to replace characters in a text file with items from a dictionary? (ASCII characters to Unicode)

Question

I have created a function that is supposed to read a text file and replace a lot of ASCII characters with equivalents in Unicode. The problem is that the function does not replace any characters in the string, only if I remove all of the items in the dictionary except one. I have experimented the whole day but cannot seem to find the solution to the problem.

Here is the function:

import re

match = {
    # the original dictionary contain over 100 items
    "᾿Ι" : "Ἰ",
    "᾿Α" : "Ἀ",
    "´Α" : "Ά",
    "`Α" : "Ὰ",
    "᾿Α" : "Ἀ",
    "᾿Ρ" : "ῤ",
    "῾Ρ" : "Ῥ"
}

with open("file.txt", "r", encoding="utf-8") as file, open("OUT.txt", "w", encoding="utf-8") as newfile:
    def replace_all(text, dict):
        for i, j in dict.items():
            result, count = re.subn(r"%s" % i, j, str(text))
        return result, count
    
    # start the function
    string = file.read()
    result, count = replace_all(string, match)
    
    # write out the result
    newfile.write(result)
    print("Changes: " + str(count))

The text file contains a lot of rows similar to the one below:

Βίβλος γενέσεως ᾿Ιησοῦ Χριστοῦ, υἱοῦ Δαυῒδ, υἱοῦ ᾿Αβραάμ.

Here the characters "᾿Ι" and "᾿Α" are supposed to be replaced with "Ἰ" and "Ἀ".

The replacement is fine. The problem is that you're not saving the result except the last one. You might want to learn [how to step through Python code](/q/4929251/4518341) to help identify problems in the future. — wjandrea, Aug 26 '23 at 19:43
Thank you, that is very helpful. Just for clarification, do you mean that the function iterate over the text file for every dictionary item and that each iteration need to be saved? — Lavonen, Aug 26 '23 at 19:56
Oh wait a minute, there's a better way to do this. See [How can I do multiple substitutions using regex?](/q/15175142/4518341) — wjandrea, Aug 26 '23 at 21:13
@Lavonen Not each iteration, the cumulative changes. Like, first we replace key0 with value0 and save the result, then take the result and replace key1 with value1, etc. And initially, `result = text`. — wjandrea, Aug 26 '23 at 21:25
BTW, two things you could improve: `print("Changes:", count)` and avoid [shadowing](//en.wikipedia.org/wiki/Variable_shadowing) the [builtin `dict` type](//docs.python.org/3/library/stdtypes.html#list). You could call the parameter `d` if you want to be concise or `replacements` if you want to be descriptive (which I'd recommend). — wjandrea, Aug 26 '23 at 21:35

Jesse Sealand · Answer 1 · 2023-08-26T22:09:21.517

1

Update the string you are replacing items in after each loop in the dictionary: For example

    def replace_all(text, dict):
        updated_string = text
        total_changes = 0
        for i, j in dict.items():
            updated_string, count = re.subn(r"%s" % i, j, updated_string)
            total_changes += count
        return updated_string, total_changes

This will ensure that any previous replacements are carried forward.

edited Aug 26 '23 at 22:09

answered Aug 26 '23 at 20:51

Jesse Sealand

302
1
11

Right approach! but wrong implementation. `updated_string = string.copy()` should be `updated_string = text` since strings don't have a `.copy()` method and `string` is outside the function's scope. – wjandrea Aug 26 '23 at 21:29
BTW, `str(updated_string)` is unnecessary. You could just do `updated_string = str(text)` and leave it at that, although I'm not sure why OP's converting `text` to string in the first place. – wjandrea Aug 26 '23 at 21:31
1

BTW, welcome back to Stack Overflow! You can check out [answer] if you want tips. One thing not mentioned there is it's a good idea to test your code before posting it. – wjandrea Aug 26 '23 at 21:37
1

Corrected changes in my response. Good catch on the variable name and redundant str() call. – Jesse Sealand Aug 26 '23 at 22:09

score 0 · Accepted Answer · answered Aug 26 '23 at 22:32

You're assigning the result of each replacement operation to the same result variable in each iteration, which means only the last replacement will be stored in resultand you're not correctly updating the count variable to keep track of the total number of replacements.

import re

match = {
    "᾿Ι" : "Ἰ",
    "᾿Α" : "Ἀ",
    "´Α" : "Ά",
    "`Α" : "Ὰ",
    "᾿Α" : "Ἀ",
    "᾿Ρ" : "ῤ",
    "῾Ρ" : "Ῥ"
}

def replace_all(text, dict):
    result = text
    count = 0
    for i, j in dict.items():
        result, num_replacements = re.subn(i, j, result)
        count += num_replacements
    return result, count

with open("file.txt", "r", encoding="utf-8") as file, open("OUT.txt", "w", encoding="utf-8") as newfile:
    
    string = file.read()
    result, count = replace_all(string, match)
    
    newfile.write(result)
    print("Changes: " + str(count))

now this code will update the count for each replacement operation and It use the re.subn function to replace occurrences of each key in the match dictionary.

Thank you, a lot, for the spoon feeding. The problem turned out to be much more complicated than I initially though. I will study your example diligently. — Lavonen, Aug 27 '23 at 10:28

How to replace characters in a text file with items from a dictionary? (ASCII characters to Unicode)

2 Answers2