Why some characters of a string of words are missing and not?

Question

In email_two, it contains a global string of some paragraphs which includes 'researchers' and 'herself'. I had to censor words of email_two from the proprietary_terms list (it subs into term in the function). However, when I used

email_two_new = email_two.split()

for item in email_two_new:
    for i in range(len(term)):
      if item in term[i]:

it sliced off 'her' from 'researchers' and 'herself'. 'researchers' shouldn't be censored and 'herself' should be completely censored as it is on the list. I checked that 'researchers' is not in 'her' so it shouldn't be sliced off and item is printed as a whole string of each word instead of each character of a word, so I don't know what went wrong.

proprietary_terms = ["she", "personality matrix", "sense of self", "self-preservation", "learning algorithm", "her", "herself"]
def censor_email_two(term):
  result = email_two
  email_two_new = email_two.split()

  for item in email_two_new:
    for i in range(len(term)):
      if item in term[i]:
        result = ''.join(result.split(term[i]))
      else:
        continue
  return result

What is the intended effect of calling `email_two.split()`. With no argument does it not simply make `email_two` an element in a list? Analogous to doing `[email_two]`? — PyPingu, Jul 22 '19 at 09:53
You are probably better off looking for each term using a regex, with word boundaries. [Docs here](https://docs.python.org/3.7/library/re.html#regular-expression-syntax) — PyPingu, Jul 22 '19 at 09:56
@PyPingu I want to try to get rid of censoring 'researchers' since 'researchers' cannot be in 'her' and it is censored if I match `term[i]` with the whole string of `email_two`, aka to make the `if item in term[i]:` work. — Sharon Ng, Jul 22 '19 at 10:18
@PyPingu but I cannot use regex with a variable like term? The term is supposed to change every time... — Sharon Ng, Jul 22 '19 at 10:19
Also you are returning `result` which will just be the unedited `email_two` — PyPingu, Jul 22 '19 at 10:38
@PyPingu here's the paragraph before the problematic censoring: ```Not only that, but we have configured her personality matrix to allow for communication between the system and our team of researchers. That's how we know she considers herself to be a she! We asked! ``` That's how it prints. — Sharon Ng, Jul 22 '19 at 10:38
@PyPingu I wanted the program to censor the term[i] for each loop, so `result` in the last term is supposed to return the edited email_two with all words censored from the list (`term`)? — Sharon Ng, Jul 22 '19 at 10:42

PyPingu · Answer 1 · 2019-07-22T14:27:08.747

So I think that this is best done using regex.

proprietary_terms = [
    "she", "personality matrix", "sense of self", 
    "self-preservation", "learning algorithm", "her", "herself"
]

def censor_email_two(email_string, terms, rep_str):
    subbed_str = email_string
    for t in terms: 
        pat = r'\b%s\b' % t 
        subbed_str = re.sub(pat, rep_str, subbed_str)
    #Run a split and join to remove double spaces created by the re.sub
    return ' '.join(subbed_str.split())

estr = "Not only that, but we have configured her personality matrix to allow for communication between the system and our team of researchers. That's how we know she considers herself to be a she! We asked!"

censor_email_two(estr, proprietary_terms, '')

Resulting string:

"Not only that, but we have configured to allow for communication between the system and our team of researchers. That's how we know considers to be a ! We asked!"

You can use the rep_str parameter to more easily see where there has been censoring:

censor_email_two(estr, proprietary_terms, "CENSORED")

"Not only that, but we have configured CENSORED CENSORED to allow for communication between the system and our team of researchers. That's how we know CENSORED considers CENSORED to be a CENSORED! We asked!"

EDIT: Added the rep_str feature

EDIT 2: Some further explanation on the regex.

So r indicates a raw string.

Then \b is looking for a word boundary - from the docs:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

The %s is string formatting and is replaced by t which is each term in the loop. If you are using Python 3.6 or later this could be replaced by combining f string notation with r raw string: fr'\b{t}\b'.

I think technically you could use the .format() syntax too, but due to the raw string behaviour using the old % style is easier.

@PyPingy can you explain what `pat = r'\b%s\b' % t` means? I haven't learnt regex yet. Thanks btw. — Sharon Ng, Jul 22 '19 at 14:17

Why some characters of a string of words are missing and not?

1 Answers1