1

I've a little regex (\d\.){2,} to split Chapters of a Book. The Chapters are recognized as a single digit followed by a dot and this combination occures at least twice. It should just split Chapters no single listings. Here's an example:

3.2.4.2. porta pellentesque   
139. Nunc maximus maximus aliquet? 
 a) dignissim 
 b) volutpat  
 c) ullamcorper  

3.2.4.3. ligula at condimentum fringilla  
152. Sed dapibus nulla mi, id lobortis ligula bibendum vehicula?  
 a) vestibulum   
 b) pellentesque   
 c) tempus   
 d) rutrum   
 
153. Lorem ipsum dolor sit amet. Sed iaculis lacus pellentesque, non auctor eros lobortis?  
 a) suscipit   
 b) vulputate   
 c) vestibulum   
 d) congue   
 
3.2.5. elementum quis  

It should be split at 3.2.4.2., 3.2.4.3. and 3.2.5. The regex Builder recognize the correct match but it always add an unwanted group match at the end and i don't get rid of that. The result looks like (one Bullet is one split):

  • 3.2.4.
  • 2.
  • ...

  • 3.2.4.
  • 3.
  • ...

  • 3.2.
  • 5.
  • ...

I want it to be three splits not nine. I tried it with greedy/lazy quantifiers, various encapsulations but unfortunately I didn't get it right. What may be worth mentioning is that the whole thing should run in a python project. For a better understanding here is the link to the regexbuilder I used.

Andy_Lima
  • 129
  • 12

1 Answers1

2

Your capturing group only contains one instance of the number and you match on that group multiple times. If you want to put all your instances in one group you need to put the quantifier in the grop. Since you also probably want to discard the inner group with the quantifier you might want to use ?: to ignore that group.

import re

r = re.compile("((?:\d\.){2,})")
s = """3.2.4.2. porta pellentesque
139. Nunc maximus maximus aliquet?
...
"""

r.findall(s) # ['3.2.4.2.', '3.2.4.3.', '3.2.5.']

As mentioned in the comments to the original post and the related question, this can also be solved by not using capturing groups combined with findall which is probable the better solution to this question.

re.findall("(?:\d\.){2,}", s) # ['3.2.4.2.', '3.2.4.3.', '3.2.5.']
Simon S.
  • 931
  • 1
  • 7
  • 21
  • You don't need re.MULTILINE as there are no anchors in the pattern. – The fourth bird Jul 31 '23 at 15:26
  • @The fourth bird what are the similarities between this question and re.findall behaves weird? because they both use pythons re bibliothe, both questions about regex, what else? – Andy_Lima Jul 31 '23 at 15:45
  • @Andy_Lima it is a known duplicate that re.findall returns the capture group values. You also don't need the capture group from this answer but just `(?:\d\.){2,}` – The fourth bird Jul 31 '23 at 16:00
  • @The fourth bird Thanks! I update the answer and also added the solution to just not use any capturing groups which seems like the correct approach. – Simon S. Aug 02 '23 at 19:17