1

I need to remove a given vector of words from a sentence (a given String) in Python.

The problem is that i want to remove exactly words but not substrings or subwords.

note: i cannot assume that before or after the word there is a space

I tried the .replace(word,"") function but not works

example: s = "I'am at home and i will work by webcam call"

when i do s.replace("am","")

outputs: i' at home and i will work by webc call

maybe can help the tokenization?

Chris
  • 29,127
  • 3
  • 28
  • 51
  • Possible duplicate of [Removing list of words from a string](https://stackoverflow.com/questions/25346058/removing-list-of-words-from-a-string) – ayorgo Jun 24 '19 at 08:56

2 Answers2

3

You can use a regular expression to re.sub with a word boundary \b character:

>>> import re
>>> s = "I'am at home and i will work by webcam call"
>>> re.sub(r"\bam\b", "", s)
"I' at home and i will work by webcam call"

With a list of words, you can use a loop, or build a disjunction from the several words using |, e.g. "am|and|i". Optionally use the re.I flag to ignore upper/lowercase:

>>> words = ["am", "and", "i"]
>>> re.sub(r"\b(%s)\b" % "|".join(words), "", s, flags=re.I)
"' at home   will work by webcam call"
tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • [This answer](https://stackoverflow.com/a/48051693/4755520) to the question mentioned above gives the exact and straightforward solution for the OP's problem that essentially boils down to `re.split('\W+', "Don't answer duplicates, please")`. – ayorgo Jun 24 '19 at 11:59
  • @ayorgo Not really. I scanned the answers on the "dupe", but none really did what OP wanted. E.g. the one you linked to (which indeed does something similar to `\b` with `\W`) will discard any punctuation. – tobias_k Jun 24 '19 at 12:20
  • Tbh it's not completely clear what the OP needs exactly but given the tags [tag:machine-learning], [tag:nlp] and [tag:recurrent-neural-network] it seems that having a list of words for an output would suffice. – ayorgo Jun 24 '19 at 12:31
0

You could use a list comprehension like so:

sentence_filtered = " ".join([word for word in sentence.split() if word.lower() not in vector_of_words])
Wytamma Wirth
  • 543
  • 3
  • 12
  • "note: i cannot assume that before or after the word there is a space" – tobias_k Jun 24 '19 at 09:24
  • Before OR after. I assumed that meant the starting (no space before) or ending (no space after) words. Can OP please clarify. – Wytamma Wirth Jun 24 '19 at 09:28
  • How does that make a difference? Point is, that `split` won't split `am` from `i'am` and thus it will not work on the provided example. – tobias_k Jun 24 '19 at 09:32
  • "i want to remove exactly words but not substrings or subwords." `am` is a substring of `i'am`. Would be more helpful if the OP supplied 'expected results'. – Wytamma Wirth Jun 24 '19 at 09:35
  • The `i'am` might be a bad example, as you could argue whether that's one word or two, but I think it's pretty clear that OP also wants to remove words if they are e.g. followed by punctuation, like commas or quotes. – tobias_k Jun 24 '19 at 09:40