python regex to split on certain patterns with skip patterns

Question

I want to split a Python string on certain patterns but not others. For example, I have the string

Joe, Dave, Professional, Ph.D. and Someone else

I want to split on \sand\s and ,, but not , Ph.D.

How can this be accomplished in Python regex?

Could you clarify a little bit what exactly it is you're trying to accomplish at the end of things here? Are you trying to get it to return a split-up string with the things you mentioned as delimiters or what? I'm sorry, it's a little unclear to me. From what I'm gleaning from it, though, you might be able to do this with Python's str.split() method. — ajcl, Jul 29 '11 at 03:45
I'm looking for an re or str.split() method to split the above into ['Joe', 'Dave', 'Professional, Ph.D.', 'Someone else'] — chriskirk, Jul 29 '11 at 03:53
check this out http://stackoverflow.com/questions/1059559/python-strings-split-with-multiple-separators — Hassek, Jul 29 '11 at 04:31

score 3 · Accepted Answer · answered Jul 29 '11 at 06:30

3

You can use:

re.split(r'\s+and\s+|,(?!\s*Ph\.D\.)\s*', 'Joe, Dave, Professional, Ph.D. and Someone else')

Result:

['Joe', 'Dave', 'Professional, Ph.D.', 'Someone else']

answered Jul 29 '11 at 06:30

Qtax

33,241
9
83
121

This gives 'Professional, Ph.D.' as one element where as they are actually two : Professional and Ph.D. – Gopal Chitalia Aug 22 '18 at 17:03
@GopalChitalia if you read the question you will see that this answer was exactly what was asked for: *"I want to split on `\sand\s` and `,`, but not `, Ph.D.`"* – Qtax Aug 23 '18 at 11:31
umm okay! My bad, sorry for that. Can you tell me a regex though on how to take into consideration all that? Like I have a very large string and I want to tokenize into single words. since I can't split it using just space, as it many patterns. Example: ( just giving examples of what all kinds of patterns exist. ) `''Anarchism'' , U.S.A , from_the_Encyclopaedia , [[Hierarchy|hierarchical]], , {{Redirect|Anarchists}} , year=2005, 0-7546-6196-2,http://web.archive.org` basically a very large string with all kinds of patterns we can think of and I want to tokenize them.can you help? – Gopal Chitalia Aug 23 '18 at 12:13
@GopalChitalia isn't not clear to me exactly what you want here. Split on `,`? But just make a new question for this with all the details. – Qtax Aug 23 '18 at 12:40

score 0 · Answer 2 · answered Jul 29 '11 at 05:02

You can basically do this with either regular expressions or just regular string manipulation operations, (i.e. str.split())

Here's an example that shows you how to do this using just regular string manipulation operations:

>>> DATA = 'Joe, Dave, Professional, Ph.D. and Someone else'
>>> IGNORE_THESE = frozenset([',', 'and'])

>>> PRUNED_DATA = [d.strip(',') for d in DATA.split(' ') if d not in IGNORE_THESE]
>>>> print PRUNED_DATA
['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone', 'else']

I'm sure there is going to be some complex regular expression you can use, but this seems very straight forward for me and quite maintainable.

I hope you're not trying to parse natural language, for that, I would use some other library like NLTK

**'Someone'** and **'else'** are splitted while OP want to obtain **'Someone else'** — eyquem, Jul 29 '11 at 06:44

eyquem · Answer 3 · 2011-07-29T09:14:24.073

import re

DATA = 'Joe, Dave, Professional, Ph.D. and Someone else'

regx = re.compile('\s*(?:,|and)\s*')

print regx.split(DATA)

result

['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone else']

Where's the difficulty ?

Note that with (?:,|and) the delimiters don't appear in the result, while with (;|and) the result would be

['Joe', ',', 'Dave', ',', 'Professional', ',', 'Ph.D.', 'and', 'Someone else']

Edit 1

errrr.... the difficulty is that with

DATA = 'Joe, Dave, Professional, Handicaped, Ph.D. and Someone else'

the result is

['Joe', 'Dave', 'Professional', 'H', 'icaped', 'Ph.D.', 'Someone else']

.

Corrected:

regx = re.compile('\s+and\s+|\s*,\s*')

.

Edit 2

errrr.. ahem... ahem...

Sorry, I hadn't noticed that Professional, Ph.D. must not be splitted. But what is the criterion to not split according the comma in this string ?

I choosed this criterion: "a comma not followed by a string having dots before the next comma"

Another problem is if whitespaces and word 'and' are mixed.

And there's also the problem of heading and trailing whitespaces.

Finally I succeeded to write a regex' pattern that manages much more cases than the previous one, even if some cases are somewhat artificial (like lost 'and' present at the end of a string; why not at the beginning too, also ? etc) :

import re

regx = re.compile('\s*,(?!(?:[^,.]+?\.)+[^,.]*?,)(?:\sand[,\s]+|\s)*|\s*and[,\s]+|[.\s]*\Z|\A\s*')

DATA = '   Joe ,and, Dave , Professional, Ph.D., Handicapped   and handyman  , and   Someone else  and  . .'
print repr(DATA)
print
print regx.split(DATA)

result

'   Joe ,and, Dave , Professional, Ph.D., Handicapped   and handyman  , and   Someone else  and  . .'

['', 'Joe', '', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else', '', '']

.

With print [x for x in regx.split(DATA) if x] , we obtain:

['Joe', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else']

.

Compared to result of Qtax's regex on the same string:

['   Joe ', 'and', 'Dave ', 'Professional, Ph.D.', 'Handicapped', 'handyman  ', 'and   Someone else', '. .']

python regex to split on certain patterns with skip patterns

3 Answers3

Edit 1

Edit 2

Linked