0

I've many string that have two possible format to display page numbers: (pp. 4500-4503) or just 4500-4503 (there may be also cases where I have only one page so (pp. 113) or just 11 .

Some exemples of strings:

- Mitchell, J.A. (2017). Citation: Why is it so important. Mendeley Journal, 67(2), (pp. 81-95). 

- Denhart, H. (2008). Deconstructing barriers: Perceptions of students labeled with learning disabilities in higher education. Journal of Learning Disabilities, 41, 483-497.

I'm using this regex for the first format:

r"pp\. \d+-\d+"

And this for the second one:

r"\d+-\d+"

Neither of them are working. I was also wondering: is there a way to use only one regex expression instead of creating two? Thank you

Diana Mele
  • 135
  • 8
  • Your patterns seem to be working. To create a single pattern, you can use an optional non capture group `(?:pp\. )?\d+-\d+` https://regex101.com/r/KOokjR/1 – The fourth bird Apr 20 '22 at 13:45
  • Neither works, that is strange. Are you sure the hyphen is a regular hyphen? Try `re.findall(r"(?:\bpp\.\s*)?\d+[—–-]\d+", text)`. Note that `\s` matches *any* Unocode whitespace in Python `re` and the `[—–-]` pattern will match en-dash, em-dash and the hyphen. Here you can find all possible [Unicode dash patterns](https://stackoverflow.com/a/48923796/3832970). – Wiktor Stribiżew Apr 20 '22 at 13:47
  • Thank you! It works for the `(pp. number-number)` format but not for the `number-numbe`r one. Also I've changed it to `r"\((?:pp\. )\)?\d+-\d+"` to have the "()", but I don't know why I'm only getting `(pp. number-number` without the last one `)`. It also come to my mind that there may be the occurance where I have only one digit number when there is only a page. How could I change the code for that occurance? thanks again – Diana Mele Apr 20 '22 at 13:57
  • `re.findall(r"(?:\(?\bpp\.\s*)?\d+(?:[—–-]\d+)?\)?", text)`? – Wiktor Stribiżew Apr 20 '22 at 14:09
  • You can match both like this `\(pp\. \d+(?:-\d+)?\)|\b\d+-\d+\b` https://regex101.com/r/h53rOT/1 – The fourth bird Apr 20 '22 at 14:11
  • `or just 11` How do you know that 11 or a single number is a page number? – The fourth bird Apr 20 '22 at 14:20
  • Thank you all! `r"(?:\bpp\.\s*)?\d+[—–-]\d+"` is working perfectly for the `(pp. number-number)` format but I'm having problem with the `number-number` one since it gives me back also numbers inside a url. Usually the page numbers follow a commas and then there is a dot (like this: `, number-number.` ) How can I change the code according to this? Same goes for when there is only one page listed `, number.` and the ` (pp. number)` format. Thank you all for your help. Is the first time I’m working with regex and I’m a little bit lost. Thanks again – Diana Mele Apr 20 '22 at 20:47

2 Answers2

1

This pattern matches all your different formats:

(\(pp\.)? \d+(-\d+)?\)?

https://regex101.com/r/HV7rlJ/2

1

You might use:

\(pp\.\s+\d+(?:-\d+)?\)|\b\d+(?:-\d+)?(?=(?:\s*,\s*\d+(?:-\d+)?)*\.)

Explanation

  • \(pp\.\s+\d+(?:-\d+)?\)
  • | Or
  • \b A word boundary
  • \d+(?:-\d+)? Match 1+ digits and optionally - and 1+ digits
  • (?= Positive lookahead, assert what is to the right is
    • (?: Non capture group to repeat as a whole part
      • \s*,\s* Match a comma between optional whitespace chars
      • \d+(?:-\d+)? Match 1+ digits and optionally - and 1+ digits
    • )* Close the non capture group and optionally repeat it
    • \.
  • ) Close lookahead

See a regex demo and a Python demo.

Example

import re

pattern = r"\(pp\.\s+\d+(?:-\d+)?\)|\b\d+(?:-\d+)?(?=(?:\s*,\s*\d+(?:-\d+)?)*\.)"

s = ("- Mitchell, J.A. (2017). Citation: Why is it so important. Mendeley Journal, 67(2), (pp. 81-95). \n\n"
            "- Denhart, H. (2008). Deconstructing barriers: Perceptions of students labeled with learning disabilities in higher education. Journal of Learning Disabilities, 41, 483-497.")

print(re.findall(pattern, s))

Output

['(pp. 81-95)', '41', '483-497']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Thank you so much! It worked! I only had to edit it a little bit since I have found out that sometimes my hyphen is not really an hypen and to remove the occurence between the two commas since right now I need only to find out number between commas and dot. This is how I edited the code: `r"\(pp\.\s+\d+(?:[—–-]\d+)?\)|\b\d+(?:[—–-]\d+)?(?=(?:\d+(?:[—–-]\d+)?)*\.)"` it seems to work fine but I'd like to ask someone more expert if I can use it this way. Thanks again! – Diana Mele Apr 21 '22 at 00:45
  • 2
    @DianaMele In that case you can shorten the pattern to `\(pp\.\s+\d+(?:[—–-]\d+)?\)|\b\d+(?:[—–-]\d+)?(?=\.)` See https://regex101.com/r/8Gy9Yg/1 – The fourth bird Apr 21 '22 at 05:41