0

I'm currently producing a python program to search through .txt files provided and remove any numbers, commas, and some certain words. It's for use in transcribing phone calls, so these are words like "um" and "uh" which are unnecessary. This is written back into a new text file which contains everything except the removed data.

The code I have produced works, but also removes those words from any longer words containing them, for example "momentum" becomes "moment" as it contains "um". Here is the code:

infile = "testfile.txt"
outfile = "cleanedfile.txt"
numbers = [1,2,3,4,5,6,7,8,9]
deleteList = [",", "Um", "um", "Uh", "uh", str(numbers)]
fin = open(infile)
fout = open(outfile, 'w+')
for line in fin:
    for word in deleteList:
        line = line.replace(word, "")
    fout.write(line)
fin.close()
fout.close()

Any help would be greatly appreciated.

edToms
  • 11
  • 3
  • You're going to want to use regex here instead of `line.replace`. A regex of form ` um ` would match only individual words since the 'um' is surrounded by spaces. The [documentation](https://docs.python.org/3/howto/regex.html) should explain how. – MCBama Dec 06 '17 at 17:06
  • Does checking for a space or beginning/end of line before and after the word work? Like checking for `" Um "` instead of `"Um"`? There are regex's as well that will let you check if the start of the line comes right before or the end of the line comes right after the word, since there aren't spaces to match in those cases. – Davy M Dec 06 '17 at 17:06
  • Also, do you want to get rid of words that are directly followed by punctuation? Like `"This is um, mine."` Should just um be removed `"This is , mine."` or the punctuation too: `"This is mine."` ? – Davy M Dec 06 '17 at 17:08
  • 1
    You can use `\b` to regex-match word boundaries, which will allow you to match a word 'foo' with a regex like `r"\bfoo\b"`, which would match the foo in "this is foo barred" but it wouldn't match "I totally foobarred it". See other word-boundary questions e.g. https://stackoverflow.com/questions/3995034/does-python-re-module-support-word-boundaries-b – Tom Dalton Dec 06 '17 at 17:10

1 Answers1

1

I've solved it using regex, changing the code to look like this:

import re

for line in fin:
    line = re.sub(r"\b(U|u)(m|h)\b", "", line)
    fout.write(line)

Thanks everyone for their help.

scrpy
  • 985
  • 6
  • 23
edToms
  • 11
  • 3