Regex works in Sublime, not in Python (Jupyter)

Question

I am creating a Jupyter notebook to clean a large amount of novels with regex code I am testing in Sublime. A lot of my texts contain the phrase 'digitized by Google' because that is where I got the PDF that I ran through Optical Character Recognition from. I want to remove all sentences that contain the phrase 'Digitized', or rather 'gitized' since the first part isn't always correctly transcribed.

When I use this phrase in Sublimes 'replace function', I get exactly the results I want:

^.*igitized.*$

However, when I try to use the re.sub method in my Jupyter notebook, which works from some other phrases, the 'Digitized by Google' lines are NOT correctly identified and replaced by 'nothing'.

text = re.sub(r'^.*igitized.*$', '', text)

What am I missing?

The regex seems fine, do all occurences if `Digitized...` starts on a start of line? — Mohit Solanki, Apr 18 '19 at 19:17
Have you tried using non greedy quantifiers? I would imagine the beginning of your regex string, (^.*) would be greedy by default and consume everything following it. Can you try changing your string to ```r'^.*?igitized.*?$'``` ? The question mark tells regex that the previous quantifier is non-greedy and should match as few things as possible- So it will stop consuming characters once igitized is found — SyntaxVoid, Apr 18 '19 at 19:20
Are you running the regex against 1 line at a time or against the entire file? This regex will only work if you run it against 1 line at a time, you can't feed it an entire file. — Gillespie, Apr 18 '19 at 19:22
Might be repeated question see https://stackoverflow.com/questions/31400362/using-to-match-beginning-of-line-in-python-regex — Serge, Apr 18 '19 at 19:26

Serge · Answer 1 · 2019-04-18T20:21:13.627

0

By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. Add re.MULTILINE flag to match beginning of line.

text = re.sub(r'^.*igitized.*$', '', text, flags=re.MULTILINE)

Using ^ to match beginning of line in Python regex

edited Apr 18 '19 at 20:21

answered Apr 18 '19 at 19:30

Serge

3,387
3
16
34

Hi Serge, re.sub does not accept 'flag' as an addition, but your explanation of ^ and $ worked like a charm once I removed them. Much appreciated! – Maartje Apr 18 '19 at 20:15
it is actually flags – Serge Apr 18 '19 at 20:21

Regex works in Sublime, not in Python (Jupyter)

1 Answers1