1

I am looking to delete any text from a string in python that matches something along the lines of "\nPage 10 of 12\n" where 10 and 12 are always different numbers (looping through 300+ documents that all have different page lengths). Example of some text that is in my string below (and then what i would want the output to be):

thisisaboutthen\n\n\nPage 2 of 12\n\nnowwearegoing\n\nPage 3 of 12\n\n\n\

Output -> thisisaboutthennnowwearegoing

I am trying the code:

page = r'\nPage \b\d+\b of \b\d+\b\n+'
return re.sub(page, '', string)

But I can't get it to work. I tried to refer to this link Python: Extract numbers from a string for help but I can't seem to combine numbers and letters together.

I'm new to regex in python and any help would be great. I have been able to get regex to work when it is just letters or just numbers, but running into problems when combining them.

Thanks in advance

eluth
  • 69
  • 2
  • 13
  • All you need is a `+` before `Page`: `\n+Page \b\d+\b of \b\d+\b\n+` – Aran-Fey Jan 08 '18 at 22:48
  • Your pattern looks good. However the word-boundaries are useless, you can remove them. Show more code. Also, are you sure that newline sequences are only `\n` and not `\r\n` or `\r` or something more exotic? How many newlines do you want to keep? – Casimir et Hippolyte Jan 08 '18 at 23:51
  • Are you sure that `string` contains the whole text and not only a single line? (because obviously a single line can't contain two or more newline sequences as your pattern describes it). In this case, I repeat my previous advice: Show more code. – Casimir et Hippolyte Jan 08 '18 at 23:58

2 Answers2

0

One way might be

import re

string = """thisisaboutthen


Page 2 of 12

nowwearegoing

Page 3 of 12



"""

string = re.sub(r'\s*Page \d+ of \d+\s*', '', string)
print(string)

Which yields

thisisaboutthennowwearegoing

See a demo on regex101.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • 1
    Thank you Jan this worked perfectly. Also thanks for the helpful link - much easier than testing in python – eluth Jan 09 '18 at 01:48
0

I'm not sure about the context, but instead of specifying line breaks (\n) and spaces you can use \s. With + you say regex one or more.

import re
string = 'thisisaboutthen\n\n\nPage 2 of 12\n\nnowwearegoing\n\nPage 3 of 12\n\n\n'
pattern = r'\s+Page\s+\d+\s+of\s+\d+\s+'
print(re.sub(pattern, '', string))

With \d you choose numbers, With \s you choose space characters (space and \t, \n, \r, \f, \v). It may be useful to use re.IGNORECASE.

Schcriher
  • 913
  • 10
  • 16
  • for some reason this still didn't work, but Jans above did with the * instead of +. Thanks for explaining the use of \s though that will be helpful – eluth Jan 09 '18 at 01:49