0

I want to split a string by remove everything expect alphabetical characters.

By default, split only splits by whitespace between words. But I want to split by everything expect alphabetical characters. How can I add multiple delimiter to split?

For example:

word1 = input().lower().split() 
# if you input " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."
#the result will be ['has', '15', 'science@and^engineering--departments,', 'affiliated', 'centers,', 'bandar', 'abbas&&and', 'mahshahr.']

But I am looking for this kind of result:

['has', '15', 'science', 'and', 'engineering', 'departments', 'affiliated', 'centers', 'bandar', 'abbas', 'and', 'mahshahr']
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Also https://stackoverflow.com/questions/1059559/split-strings-with-multiple-delimiters – OneCricketeer Jul 15 '18 at 14:38
  • You could do `import re` and `words = re.findall(r"\w+", input().lower())`. – trincot Jul 15 '18 at 14:40
  • @jonrsharpe, I think this is a different question. I believe OP is trying to split by all alphanumerical characters. Not split by selected characters only. There may be another dup but I couldn't find it. – jpp Jul 15 '18 at 14:41
  • @jpp, if problem is to *split* on alphanumeric, wouldn't there be non-alphanumeric characters in the result? It seems that splitting on multiple delimiters is a duplicate regardless of which set of delimiters are used for the split - the only difference in a regex solution would be the pattern used. – wwii Jul 15 '18 at 14:58
  • 1
    @wwii, See my answer, seems to solve the problem without being an answer to the proposed duplicate. Although everyone seems to prefer regex. Possibly the question needs more clarity, but then it's unclear / too broad rather than a dup. – jpp Jul 15 '18 at 15:00
  • @jpp - I saw that and was happily surprised - that's why I limited my comment to regex solutions. – wwii Jul 15 '18 at 15:03

2 Answers2

5

For performance, you should use regex as per the marked duplicate. See benchmarking below.

groupby + str.isalnum

You can use itertools.groupby with str.isalnum to group by characters which are alphanumeric.

With this solution you do not have to worry about splitting by explicitly specified characters.

from itertools import groupby

x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."

res = [''.join(j) for i, j in groupby(x, key=str.isalnum) if i]

print(res)

['has', '15', 'science', 'and', 'engineering', 'departments',
 'affiliated', 'centers', 'Bandar', 'Abbas', 'and', 'Mahshahr']

Benchmarking vs regex

Some performance benchmarking versus regex solutions (tested on Python 3.6.5):

from itertools import groupby
import re

x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."

z = x*10000
%timeit [''.join(j) for i, j in groupby(z, key=str.isalnum) if i]  # 184 ms
%timeit list(filter(None, re.sub(r'\W+', ',', z).split(',')))      # 82.1 ms
%timeit list(filter(None, re.split('\W+', z)))                     # 63.6 ms
%timeit [_ for _ in re.split(r'\W', z) if _]                       # 62.9 ms
jpp
  • 159,742
  • 34
  • 281
  • 339
2

You can replace all the non-alphanumeric characters with a single character (I'm using comma)

s = 'has15science@and^engineering--departments,affiliatedcenters,bandarabbas&&andmahshahr.'

alphanumeric = re.sub(r'\W+', ',',s) 

and then split it on comma:

splitted = alphanumeric.split(',')

Edit:

As suggested by, @DeepSpace, this can be done in a single statement:

splitted = re.split('\W+', s)
Ankit Jaiswal
  • 22,859
  • 5
  • 41
  • 64