0

We have several texts (strings) that contain descriptions (not part of the produced speech) like [inaudible] and [Laughter]. We want to delete those elements from our string. They always have the same structure and are written in [...]. Example:

text="I think I could pretty much say, Mike, most of them have become stars, if not all. Because you won. Winning is a wonderful thing. [Laughter] So I thought what I'd do is go around the room"

That's what we tried so far:

 text2=re.sub('[.*]', '', text)

or

text2=re.sub('/[.*/]', '', text)

If the text has two or more of these elements [inaudible] and so on, it deletes all the text in between these elements. That should not happen and we don't know how to avoid it. The first example sometimes deletes . and sometimes it doesn't, thats confusing as well. We are python beginners :)

maomii
  • 21
  • 2

1 Answers1

2

You are using the greedy version of the repeat operator (*). Because of this, the regular expression will match the longest matching string. There is also a non-greedy operator *? which matches the shortest possible string. Greed is good, but sometimes non-greedy is better. In my personal experience I use the non-greedy operator more often than the greedy ones.

Try this:

text2=re.sub(r'\[.*?\]', '', text)

Also, compared to your version, I changed your forward slashes to backslashes to escape special characters and I used a raw string r'string' to prevent conflicts between python backslashes and regular expression backslashes.

There is an excellent tutorial on regular expressions by A.M. Kuchling. https://docs.python.org/2/howto/regex.html. All three changes are explained in more detail there.

Hans Then
  • 10,935
  • 3
  • 32
  • 51