2

I have a list of multi-row strings. I want to match first rows of those strings if they start with a variable number of digits NOT immediately followed by a period.

For example, a list might be

list = ["42. blabla \n foo", "42 blabla \n foo", "422. blabla \n foo"]

and my desired output would be 42 blabla.

This code

import re 

list = ["42. blabla \n foo", "42 blabla \n foo", "422. blabla \n foo"]

regex_header = re.compile("^[0-9]+(?!\.).*\n")

for str in list:
    print(re.findall(regex_header, str))

outputs

['42. blabla \n']
['42 blabla \n']
['422. blabla \n']

This one works only with exactly two digits in the beginning of the string:

import re 

list = ["42. blabla \n foo", "42 blabla \n foo", "422. blabla \n foo"]

regex_header = re.compile("^[0-9]{2}(?!\.).*\n")

for str in list:
    print(re.findall(regex_header, str))

Output:

[]
['42 blabla \n']
['422. blabla \n']
Sal
  • 103
  • 1
  • 4

2 Answers2

2

You need (?![.\d]) lookahead:

r"^\d+(?![.\d])"

See the regex demo. Details:

  • ^ - start of string
  • \d+ - 1+ digits
  • (?![.\d]) - no dot and any other digits are allowed to the right of the current location.

See the Python demo:

import re 
l = ["42. blabla \n foo", "42 blabla \n foo", "422. blabla \n foo"]
regex_header = re.compile(r"^[0-9]+(?![.\d])")
for s in l:
    if (regex_header.search(s)):
        print(s)
# => "42 blabla \n foo"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Just wondering re the `.` inside the `[ ]` -- do we not need to escape it here to `\.`? I don't quite understand why `\d` works as expected in the brackets, while things seem to be different for the `.`? Thanks! – patrick Jul 05 '19 at 16:55
  • 2
    @patrick Inside a character class, only ``\``, `-`, `^` and `]` should be escaped. The rest is treated as literal chars. `[.]` = `r'\.'`. See [What special characters must be escaped in regular expressions?](https://stackoverflow.com/a/400316/3832970) – Wiktor Stribiżew Jul 05 '19 at 16:58
0

My guess is that maybe this might be what we might want to output:

import re 

list = ["42. blabla \n foo", "42 blabla \n foo", "422. blabla \n foo"]

regex_header = re.compile("^[0-9]+(?!\.)\D*$")

for str in list:
    print(re.findall(regex_header, str))

Demo

Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    Perfect -- this works as desired, and replacing ```$``` with ```\n``` will return only the first row. – Sal Jul 05 '19 at 16:57
  • 1
    EDIT: this works as desired, unless the line contains digits further away from the beginning (e.g. if we replace the second string in list with ```"42 blabla 00 \n foo"```). – Sal Jul 05 '19 at 17:14