-2

I understand that when I want to use the re module in Python and the split() method, I need to provide the pattern where I want to break the string (i.e. at white spaces, I would use a pattern like this pattern = re.compile('\\s+')).

But for more complex cases, where I have a string with a pattern which looks like this:

'letters<space>letters<space>numbers<space>...repeat...'

how should I write the regex to split at every repetition? I tried to use the negate of the expression that matches the string exactly until the repetition, like is suggested here, but Python throws an error. Any suggestion?

ad3angel1s
  • 484
  • 1
  • 5
  • 17
  • Could you be more concise?, What's your expected output? – Shubham Sharma Mar 10 '20 at 18:15
  • getting all the substrings that look like: 'letterslettersnumbers', 'letterslettersnumbers', etc. – ad3angel1s Mar 10 '20 at 18:22
  • "An error": I am sure Python says much more than that. Add the code that produced the error and the resulting traceback. – Jongware Mar 10 '20 at 18:29
  • Plus it still isn't too clear what you ask. Do you need any kind of repetition? Then it isn't really a `re.split`, since you don't have any meaningful delimiter. – petre Mar 10 '20 at 18:31
  • _I tried to use the negate of the expression that matches the string exactly until the repetition, like is suggested here, but Python throws an error._ **Please provide the entire error message, as well as a [mcve].** Keep in mind that Stack Overflow is not a substitute for guides, tutorials, or documentation. – AMC Mar 10 '20 at 19:53

2 Answers2

2

Giving the example string:

text = 'aaaaa 12345 aaaaa bbbbb 12345 bbbbb ccccc 12345 ccccc'

instead you use re.split() maybe you can use re.findall():

re.findall(r'\w+\s+\w+\s+\w+', text)
# output: ['aaaaa 12345 aaaaa', 'bbbbb 12345 bbbbb', 'ccccc 12345 ccccc']

If you want to use re.split() anyway you can put it into a group and then use a generator to clean the spaces:

splitted = re.split(r'(\w+\s+\w+\s+\w+)', text)
#output: ['', 'aaaaa 12345 aaaaa', ' ', 'bbbbb 12345 bbbbb', ' ', 'ccccc 12345 ccccc', '']

[ele for ele in splitted if ele.strip()]
#output: ['aaaaa 12345 aaaaa', 'bbbbb 12345 bbbbb', 'ccccc 12345 ccccc']
soloidx
  • 729
  • 6
  • 12
1

Given I understood the question correctly, this could be a way to split the strings:

In [298]: s                                                                                                           
Out[298]: 'lettersone letterstwo 12 lettersthree lettersfour 34'

In [299]: re.findall(r'(?:\w+ \w+ \d+)', s)                                                                           
Out[299]: ['lettersone letterstwo 12', 'lettersthree lettersfour 34']
petre
  • 1,485
  • 14
  • 24