pull numbers and letters together python regex

Question

I am looking to delete any text from a string in python that matches something along the lines of "\nPage 10 of 12\n" where 10 and 12 are always different numbers (looping through 300+ documents that all have different page lengths). Example of some text that is in my string below (and then what i would want the output to be):

thisisaboutthen\n\n\nPage 2 of 12\n\nnowwearegoing\n\nPage 3 of 12\n\n\n\

Output -> thisisaboutthennnowwearegoing

I am trying the code:

page = r'\nPage \b\d+\b of \b\d+\b\n+'
return re.sub(page, '', string)

But I can't get it to work. I tried to refer to this link Python: Extract numbers from a string for help but I can't seem to combine numbers and letters together.

I'm new to regex in python and any help would be great. I have been able to get regex to work when it is just letters or just numbers, but running into problems when combining them.

Thanks in advance

All you need is a `+` before `Page`: `\n+Page \b\d+\b of \b\d+\b\n+` — Aran-Fey, Jan 08 '18 at 22:48
Your pattern looks good. However the word-boundaries are useless, you can remove them. Show more code. Also, are you sure that newline sequences are only `\n` and not `\r\n` or `\r` or something more exotic? How many newlines do you want to keep? — Casimir et Hippolyte, Jan 08 '18 at 23:51
Are you sure that `string` contains the whole text and not only a single line? (because obviously a single line can't contain two or more newline sequences as your pattern describes it). In this case, I repeat my previous advice: Show more code. — Casimir et Hippolyte, Jan 08 '18 at 23:58

Jan · Accepted Answer · 2018-01-08T23:01:26.437

0

One way might be

import re

string = """thisisaboutthen


Page 2 of 12

nowwearegoing

Page 3 of 12



"""

string = re.sub(r'\s*Page \d+ of \d+\s*', '', string)
print(string)

Which yields

thisisaboutthennowwearegoing

See a demo on regex101.com.

edited Jan 08 '18 at 23:01

answered Jan 08 '18 at 22:54

Jan

42,290
8
54
79

1

Thank you Jan this worked perfectly. Also thanks for the helpful link - much easier than testing in python – eluth Jan 09 '18 at 01:48

score 0 · Answer 2 · answered Jan 08 '18 at 23:09

0

I'm not sure about the context, but instead of specifying line breaks (\n) and spaces you can use \s. With + you say regex one or more.

import re
string = 'thisisaboutthen\n\n\nPage 2 of 12\n\nnowwearegoing\n\nPage 3 of 12\n\n\n'
pattern = r'\s+Page\s+\d+\s+of\s+\d+\s+'
print(re.sub(pattern, '', string))

With \d you choose numbers, With \s you choose space characters (space and \t, \n, \r, \f, \v). It may be useful to use re.IGNORECASE.

answered Jan 08 '18 at 23:09

Schcriher

913
10
16

for some reason this still didn't work, but Jans above did with the * instead of +. Thanks for explaining the use of \s though that will be helpful – eluth Jan 09 '18 at 01:49

pull numbers and letters together python regex

2 Answers2