-1

I have these two strings :

'''N678 N−12.,13.12.22,18.,19.,31.1.,9.2.,7. au 10.,13. au 17.,20.,21.,22.,28.
au 31.3.,6.,13.,19.,20.4.,5.,8.5.23'''

'''N678 à p. du 11.6.23 C+23.6.23 N−14.,24.7.23'''

i want split it like this: applied on the second string:

  ['N678 à p. du 11.6.23 ', 'C+23.6.23 ', 'N−14.,24.7.23']

the concept is to : each time you see the format: any_character\s\d{1,2}.\d{1,2}.\d{2,4}\s[A-Z] split it. I did the last example with this regex : ([A-Za-z].*?\d{1,2}\.\d{1,2}\.\d{2,4}\s*) but it is not working for the first one, because N678 N-12... will split somewhere around 'au'

`['N678 N−12.,13.12.22', 'au 10.,13. au 17.,20.,21.,22.,28.\nau 31.3.,6.,13.,19.,20.4.,5.,8.5.23']`

it will work fine in the first string if i delete the optional matching question mark, but that will result in incorrect result for the second string.

Any ideas ?

  • Try `re.split(r'\b([A-Za-z]\+\d{1,2}\.\d{1,2}\.\d{2,4}\s*)(?=[A-Z])', text)` – Wiktor Stribiżew Jul 19 '23 at 12:08
  • What's your expected result for the first string? – Zero Jul 19 '23 at 12:15
  • 1
    @WiktorStribiżew it did not output as expected for other strings like : N68 du 12.6. au 3.9.23 C+23.6.,15.8.23 N-18.,25.6.,2.,17. au 21.,23. au 28.,31.7.,3.,4.,7.,8.,9.,27.8.23 – Ferroum Samir Jul 19 '23 at 12:25
  • @Zero for the first string, it will capture the whole string supposedly since it does not have any other big letter inside and before that letter is that format of date. – Ferroum Samir Jul 19 '23 at 12:26
  • 2 points here, first you are matching both upper and lower case alphabets for matching a group, second, your group has a limited length, which will only match that much. – Zero Jul 19 '23 at 12:28

3 Answers3

0

You'd need to split on a look-ahead:

(?=[A-Z][+−-]?\d+\b)

https://regex101.com/r/hoyJ9q/1

Do note that the "dash" in "N−14" is not a standard dash.


JS example for reference:

var regexp = /(?=[A-Z][+−-]?\d+\b)/g;

var str = 'N678 à p. du 11.6.23 C+23.6.23 N−14.,24.7.23';

console.log(str.split(regexp));

var str = 'N678 N−12.,13.12.22,18.,19.,31.1.,9.2.,7. au 10.,13. au 17.,20.,21.,22.,28. au 31.3.,6.,13.,19.,20.4.,5.,8.5.23';

console.log(str.split(regexp));
MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77
0

Updated your regex to this

([A-Z].*?\d{1,2}\.\d{1,2}\.\d{2,4}\s*[^A-Z]*)

which is working at least for the given cases.

import re

regex = '([A-Z].*?\d{1,2}\.\d{1,2}\.\d{2,4}\s*[^A-Z]*)'

strings = [
    '''N678 N-12.,13.12.22,18.,19.,31.1.,9.2.,7. au 10.,13. au 17.,20.,21.,22.,28. au 31.3.,6.,13.,19.,20.4.,5.,8.5.23''',
    '''N678 à p. du 11.6.23 C+23.6.23 N-14.,24.7.23'''
]

for string in strings:
    print(re.findall(regex, string))

Output:

['N678 N-12.,13.12.22,18.,19.,31.1.,9.2.,7. au 10.,13. au 17.,20.,21.,22.,28. au 31.3.,6.,13.,19.,20.4.,5.,8.5.23']
['N678 à p. du 11.6.23 ', 'C+23.6.23 ', 'N-14.,24.7.23']
Zero
  • 1,807
  • 1
  • 6
  • 17
0

If you only want a match in the second string, you can match instead of splitting.

\b[A-Z].*?\d{1,2}\.\d{1,2}\.\d{2,4}(?=\s+[A-Z]|$)

The pattern matches:

  • \b A word boundary to prevent a partial word match
  • [A-Z] Match a single uppercase char A-Z (As all the matches you want seem to start with one)
  • .*? Match any character, as few as possible
  • \d{1,2}\.\d{1,2}\.\d{2,4} Match your "digits` pattern
  • (?=\s+[A-Z]|$) Positive lookahead, assert either 1+ whitespace chars follwed by a char A-Z to the right, or assert the end of the string. (note that \s can also match a newline)

Regex demo | Python demo

import re

pattern = r"\b[A-Z].*?\d{1,2}\.\d{1,2}\.\d{2,4}(?=\s+[A-Z]|$)"

s = ("N678 à p. du 11.6.23 C+23.6.23 N−14.,24.7.23\n\n"
    "N678 N−12.,13.12.22,18.,19.,31.1.,9.2.,7. au 10.,13. au 17.,20.,21.,22.,28.\n"
    "au 31.3.,6.,13.,19.,20.4.,5.,8.5.23\n\n"
    "N68 du 12.6. au 3.9.23 C+23.6.,15.8.23 N-18.,25.6.,2.,17. au 21.,23. au 28.,31.7.,3.,4.,7.,8.,9.,27.8.23\n\n\n")

print(re.findall(pattern, s, re.M))

Output

[
  'N678 à p. du 11.6.23', 'C+23.6.23', 'N−14.,24.7.23', 'N68 du 12.6. au 3.9.23', 'C+23.6.,15.8.23',
  'N-18.,25.6.,2.,17. au 21.,23. au 28.,31.7.,3.,4.,7.,8.,9.,27.8.23'
]

Note that the .*? can match this format as well \d{1,2}\.\d{1,2}\.\d{2,4} if it is not directly followed by \s+[A-Z]

If you don't want that, you could use a tempered greedy token:

\b[A-Z](?:(?!\d{1,2}\.\d{1,2}\.\d{2,4}\b).)*\d{1,2}\.\d{1,2}\.\d{2,4}(?=\s+[A-Z]|$)

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70