0

I am creating a Jupyter notebook to clean a large amount of novels with regex code I am testing in Sublime. A lot of my texts contain the phrase 'digitized by Google' because that is where I got the PDF that I ran through Optical Character Recognition from. I want to remove all sentences that contain the phrase 'Digitized', or rather 'gitized' since the first part isn't always correctly transcribed.

When I use this phrase in Sublimes 'replace function', I get exactly the results I want:

^.*igitized.*$

However, when I try to use the re.sub method in my Jupyter notebook, which works from some other phrases, the 'Digitized by Google' lines are NOT correctly identified and replaced by 'nothing'.

text = re.sub(r'^.*igitized.*$', '', text)

What am I missing?

Maartje
  • 35
  • 7
  • The regex seems fine, do all occurences if `Digitized...` starts on a start of line? – Mohit Solanki Apr 18 '19 at 19:17
  • Have you tried using non greedy quantifiers? I would imagine the beginning of your regex string, (^.*) would be greedy by default and consume everything following it. Can you try changing your string to ```r'^.*?igitized.*?$'``` ? The question mark tells regex that the previous quantifier is non-greedy and should match as few things as possible- So it will stop consuming characters once igitized is found – SyntaxVoid Apr 18 '19 at 19:20
  • Are you running the regex against 1 line at a time or against the entire file? This regex will only work if you run it against 1 line at a time, you can't feed it an entire file. – Gillespie Apr 18 '19 at 19:22
  • Might be repeated question see https://stackoverflow.com/questions/31400362/using-to-match-beginning-of-line-in-python-regex – Serge Apr 18 '19 at 19:26

1 Answers1

0

By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. Add re.MULTILINE flag to match beginning of line.

text = re.sub(r'^.*igitized.*$', '', text, flags=re.MULTILINE)

Using ^ to match beginning of line in Python regex

Serge
  • 3,387
  • 3
  • 16
  • 34