-2

I have an input that looks like the following.

word1-word2
word1 word2
word1+word2
--word1--word2-
word1-word2 
word1,word2,
(word1),word2

etc

I have to create a list that finally has ['word1','word2'] and nothing else in it. (It can have blanks that I can remove later I guess). There can be any special characters around the two words. Is there any straight forward way to do it? (a better regex??).

I am trying something in the following lines from the following link

Splitting a string with multiple delimiters in Python

re.split(r'[-+ ,]+', INPUT)

There isn't any consistency between special characters surrounding the two words.

Community
  • 1
  • 1
Ank
  • 6,040
  • 22
  • 67
  • 100
  • Why don't you just strip the empty strings from the result. – simonzack Sep 04 '14 at 18:37
  • I have mentioned that I the question.. It can have blanks that I can remove later I guess) – Ank Sep 04 '14 at 18:38
  • @simonzack Did I get downvoted for this?? – Ank Sep 04 '14 at 18:38
  • So is the code you have now not working? Or not straightforward enough? Or what? – Kevin Sep 04 '14 at 18:41
  • @Ank Yes I thought it was a rather trivial question, if you want to do this using a single `split` you should've said so in the question. – simonzack Sep 04 '14 at 18:42
  • I was looking for a way in which I don't have to add all the special characters I can think of. Can I tell python (exclude every special character that you see surrounding the words). – Ank Sep 04 '14 at 18:43
  • By the definition of splitting, your fourth example has to be split into `['', 'word1', 'word2', '']`. So the only way I can see around it is (a) don't use `split` (e.g., reverse your test and use `findall`), or (b) use `split` and then post-process to strip any "extra" empties off the ends. – abarnert Sep 04 '14 at 18:49
  • yes.. My code does strip the extra empties in post-process. – Ank Sep 04 '14 at 18:51
  • @Ank: So to clarify, your problem was "how to I match all 'special characters' in a regular expression"? If so, edit the question to make that clearer—and also define what you mean by "special characters", because it's hard to guess whether you mean the exact same thing as `\W` or something different… – abarnert Sep 04 '14 at 18:52

1 Answers1

3

Sounds like what you're really trying to do is extract words from a string that may contain special characters. So just look for words then:

re.findall(r'\w+', text)

>>> re.findall(r'\w+', "word1,word2,")
['word1', 'word2']
>>> re.findall(r'\w+', "(word1),word2")
['word1', 'word2']
>>> re.findall(r'\w+', "--word1--word2-")
['word1', 'word2']

re.findall will create a list of regex matches.

\w in regex is a special shorthand for all alphanumeric characters along with underscore (equivalent to [a-zA-Z0-9_]). So a caveat with this solution is that if you have something like word1_word2, you'll get ['word1_word2'].

If this is not desired, then go with the following regex: [a-zA-Z0-9]+

Manny D
  • 20,310
  • 2
  • 29
  • 31
  • 1
    Why the extra brackets? – simonzack Sep 04 '14 at 18:47
  • 1
    This would be better with an explanation: why using `\w` is better than "trying to name every special character I can think of", and why using `findall` to find words instead of `split` to split on non-words solves the problems with the extra blank values before and/or after the useful ones. – abarnert Sep 04 '14 at 18:50
  • Ah, because my initial answer was something like `[a-zA-Z0-9]`. Just forgot to remove them. – Manny D Sep 04 '14 at 18:50