How to split a string with many delimiter in python?

Question

I want to split a string by remove everything expect alphabetical characters.

By default, split only splits by whitespace between words. But I want to split by everything expect alphabetical characters. How can I add multiple delimiter to split?

For example:

word1 = input().lower().split() 
# if you input " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."
#the result will be ['has', '15', 'science@and^engineering--departments,', 'affiliated', 'centers,', 'bandar', 'abbas&&and', 'mahshahr.']

But I am looking for this kind of result:

['has', '15', 'science', 'and', 'engineering', 'departments', 'affiliated', 'centers', 'bandar', 'abbas', 'and', 'mahshahr']

Also https://stackoverflow.com/questions/1059559/split-strings-with-multiple-delimiters — OneCricketeer, Jul 15 '18 at 14:38
You could do `import re` and `words = re.findall(r"\w+", input().lower())`. — trincot, Jul 15 '18 at 14:40
@jonrsharpe, I think this is a different question. I believe OP is trying to split by all alphanumerical characters. Not split by selected characters only. There may be another dup but I couldn't find it. — jpp, Jul 15 '18 at 14:41
@jpp, if problem is to *split* on alphanumeric, wouldn't there be non-alphanumeric characters in the result? It seems that splitting on multiple delimiters is a duplicate regardless of which set of delimiters are used for the split - the only difference in a regex solution would be the pattern used. — wwii, Jul 15 '18 at 14:58
@wwii, See my answer, seems to solve the problem without being an answer to the proposed duplicate. Although everyone seems to prefer regex. Possibly the question needs more clarity, but then it's unclear / too broad rather than a dup. — jpp, Jul 15 '18 at 15:00
@jpp - I saw that and was happily surprised - that's why I limited my comment to regex solutions. — wwii, Jul 15 '18 at 15:03

jpp · Accepted Answer · 2018-07-15T15:27:55.527

For performance, you should use regex as per the marked duplicate. See benchmarking below.

groupby + str.isalnum

You can use itertools.groupby with str.isalnum to group by characters which are alphanumeric.

With this solution you do not have to worry about splitting by explicitly specified characters.

from itertools import groupby

x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."

res = [''.join(j) for i, j in groupby(x, key=str.isalnum) if i]

print(res)

['has', '15', 'science', 'and', 'engineering', 'departments',
 'affiliated', 'centers', 'Bandar', 'Abbas', 'and', 'Mahshahr']

Benchmarking vs regex

Some performance benchmarking versus regex solutions (tested on Python 3.6.5):

from itertools import groupby
import re

x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."

z = x*10000
%timeit [''.join(j) for i, j in groupby(z, key=str.isalnum) if i]  # 184 ms
%timeit list(filter(None, re.sub(r'\W+', ',', z).split(',')))      # 82.1 ms
%timeit list(filter(None, re.split('\W+', z)))                     # 63.6 ms
%timeit [_ for _ in re.split(r'\W', z) if _]                       # 62.9 ms

What if we also want to remove the numbers ? – Sina Arzany Jul 22 '18 at 10:58 — Sina Arzany, Jul 22 '18 at 10:58

Ankit Jaiswal · Answer 2 · 2018-07-15T14:47:31.867

2

You can replace all the non-alphanumeric characters with a single character (I'm using comma)

s = 'has15science@and^engineering--departments,affiliatedcenters,bandarabbas&&andmahshahr.'

alphanumeric = re.sub(r'\W+', ',',s)

and then split it on comma:

splitted = alphanumeric.split(',')

Edit:

As suggested by, @DeepSpace, this can be done in a single statement:

splitted = re.split('\W+', s)

edited Jul 15 '18 at 14:47

answered Jul 15 '18 at 14:43

Ankit Jaiswal

22,859
5
41
64

3

Or simply use `re.split` – DeepSpace Jul 15 '18 at 14:47
1

@DeepSpace, Thanks, updated my answer :) – Ankit Jaiswal Jul 15 '18 at 14:50

How to split a string with many delimiter in python?

2 Answers2

groupby + str.isalnum

Benchmarking vs regex