import re
DATA = 'Joe, Dave, Professional, Ph.D. and Someone else'
regx = re.compile('\s*(?:,|and)\s*')
print regx.split(DATA)
result
['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone else']
Where's the difficulty ?
Note that with (?:,|and)
the delimiters don't appear in the result, while with (;|and)
the result would be
['Joe', ',', 'Dave', ',', 'Professional', ',', 'Ph.D.', 'and', 'Someone else']
Edit 1
errrr.... the difficulty is that with
DATA = 'Joe, Dave, Professional, Handicaped, Ph.D. and Someone else'
the result is
['Joe', 'Dave', 'Professional', 'H', 'icaped', 'Ph.D.', 'Someone else']
.
Corrected:
regx = re.compile('\s+and\s+|\s*,\s*')
.
Edit 2
errrr.. ahem... ahem...
Sorry, I hadn't noticed that Professional, Ph.D. must not be splitted. But what is the criterion to not split according the comma in this string ?
I choosed this criterion: "a comma not followed by a string having dots before the next comma"
Another problem is if whitespaces and word 'and' are mixed.
And there's also the problem of heading and trailing whitespaces.
Finally I succeeded to write a regex' pattern that manages much more cases than the previous one, even if some cases are somewhat artificial (like lost 'and' present at the end of a string; why not at the beginning too, also ? etc) :
import re
regx = re.compile('\s*,(?!(?:[^,.]+?\.)+[^,.]*?,)(?:\sand[,\s]+|\s)*|\s*and[,\s]+|[.\s]*\Z|\A\s*')
DATA = ' Joe ,and, Dave , Professional, Ph.D., Handicapped and handyman , and Someone else and . .'
print repr(DATA)
print
print regx.split(DATA)
result
' Joe ,and, Dave , Professional, Ph.D., Handicapped and handyman , and Someone else and . .'
['', 'Joe', '', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else', '', '']
.
With print [x for x in regx.split(DATA) if x]
, we obtain:
['Joe', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else']
.
Compared to result of Qtax's regex on the same string:
[' Joe ', 'and', 'Dave ', 'Professional, Ph.D.', 'Handicapped', 'handyman ', 'and Someone else', '. .']