I need help to automatically DEcensore a text (lot's of text to be prosseced)

Question

I have a web story that has cencored word in it with asterix

right now i'm doing it with a simple and dumb str.replace

but as you can imagine this is a pain and I need to search in the text to find all instance of the censoring

here is bastard instance that are capitalized, plurial and with asterix in different places

toReplace = toReplace.replace("b*stard", "bastard")
toReplace = toReplace.replace("b*stards", "bastards")
toReplace = toReplace.replace("B*stard", "Bastard")
toReplace = toReplace.replace("B*stards", "Bastards")
toReplace = toReplace.replace("b*st*rd", "bastard")
toReplace = toReplace.replace("b*st*rds", "bastards")
toReplace = toReplace.replace("B*st*rd", "Bastard")
toReplace = toReplace.replace("B*st*rds", "Bastards")

is there a way to compare all word with "*" (or any other replacement character) to an already compiled dict and replace them with the uncensored version of the word ? maybe regex but I don't think so

https://docs.python.org/3/library/fnmatch.html#fnmatch.filter allows you to perform glob matching against a list of strings; if you pair this with generating a selective list from a sorted dictionary before evaluating your filter it should be effective as well. — MatsLindh, Nov 19 '22 at 17:33
Though this won't in itself solve the main problem, note that you can halve the pain by only doing the replace on singular words (since they're included in their plural form). — Swifty, Nov 19 '22 at 17:57

score 1 · Answer 1 · answered Nov 19 '22 at 17:47

Using regex alone will likely not result in a full solution for this. You would likely have an easier time if you have a simple list of the words that you want to restore, and use Levenshtein distance to determine which one is closest to a given word that you have found a * in.

One library that may help with this is fuzzywuzzy.

The two approaches that I can think of quickly:

Split the text so that you have 1 string per word. For each word, if '*' in word, then compare it to the list of replacements to find which is closest.
Use re.sub to identify the words that contain a * character, and write a function that you would use as the repl argument to determine which replacement it is closest to and return that replacement.

Additional resources:

Andrej Kesely · Answer 2 · 2022-11-19T18:22:27.147

You can use re module to find matches between the censored word and words in your wordlist.

Replace * with . (dot has special meaning in regex, it means "match every character") and then use re.match:

import re

wordlist = ["bastard", "apple", "orange"]


def find_matches(censored_word, wordlist):
    pat = re.compile(censored_word.replace("*", "."))
    return [w for w in wordlist if pat.match(w)]


print(find_matches("b*st*rd", wordlist))

Prints:

['bastard']

Note: If you want match exact word, add $ at the end of your pattern. That means appl* will not match applejuice in your dictionary for example.

I need help to automatically DEcensore a text (lot's of text to be prosseced)

2 Answers2