4

Given the following text:

text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"

I need:

["Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.",
 "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
 "She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]",
 "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]

I tried this but it doesn't work:

new_line = re.split('(?<=\.) |(([.?!](\[\d+\])+))\s', text)
print(new_line)

The result I am getting is this:

['Van der Weyden was preoccupied by commissioned\xa0portraiture\xa0towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.', None, None, None, "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers", '.[2]', '.[2]', '[2]', 'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress', '.[3][4][5]', '.[3][4][5]', '[5]', "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]
kpie
  • 9,588
  • 5
  • 28
  • 50
Aiha
  • 41
  • 7
  • 1
    You need to either escape special characters or use the `r` prefix -- the latter is the common practice. – trincot Jan 18 '22 at 21:45

2 Answers2

2

You can use

re.findall(r'(?s)(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text)

See the regex demo. Details:

  • (?s) - same as re.S or re.DOTALL, makes . match across lines
  • (.*?(?:\.|[.?!](?:\[\d+\])+)) - Group 1:
    • .*? - zero or more chars as few as possible
    • (?:\.|[.?!](?:\[\d+\])+) - either a dot or a ./?/! and the one or more occurrences of [ + digit(s) + ] substring
  • (?:\s+|\s*\Z) - either one or more whitespaces or zero or more whitespaces followed with end of string.

See the Python demo:

import re
text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
print( re.findall(r'(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text, re.DOTALL) )

Output:

[
  'Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.',
  "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
  'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]',
  "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

You need to use non-capturing groups ((?:...)) or re.split will include the captured parts in the output:

import re
new_line = re.split(r'(?<=\.) |(?:[.?!](?:\[\d+\])+)\s', text)
print(new_line)
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Hi Mozway, the output you get is not what OP expects. You just can't use `re.split` with the pattern you suggests because it removes the numbers inside square brackets at the end of the "sentences". Non-capturing groups are not lookarounds, they *consume* text. `re.split` removes consumed texts. – Wiktor Stribiżew Jan 20 '22 at 08:55
  • @Wiktor thanks for the feedback. I have to say I mostly focused my answer on the general issue that OP was using a capturing group and thus captured the split, I left their regex untouched otherwise. – mozway Jan 20 '22 at 08:58
  • The main issue is actually stated in the title of the question. The group was used in an attempt "to keep the separator". As the lookbehind in `re` cannot contain pattern of a non-fixed legnth, OP used the group, but I am sure they also used the lookbehind and it did not work. The real issue is also clear from the expected output. – Wiktor Stribiżew Jan 20 '22 at 09:01