1

I have been struggling with my regex query for over 3 hours now. The issue is that I want to cut punctuation marks as below.

String: "th1s 1s a numer1c2w0rld. Say 'He11o W0r1d!'"

If I use regex expression [^"'\s-]\w*[\d]?\w*[^-!'"\s], it leaves out a. Everything else is okay.

Here's another attempt [^"'\s-]\w*[_]{0,2}?\w*[^-!'."\s], but once again, one-letter words are ignored by regex. Please note that _ is optional, there could be max two underscores. Hence, I added the code for [_]{0,2}?

Can someone please help me? Thanks for your help.

I researched this topic on SO and found that most of the threads e.g. Regular expression to match a whole word or one letter these mostly deal with continuous words. My words are password-type. Meaning, they could have numeric data inside words. E.g. th1s or even numer1cw0rld.


Desired Output is the string of following words.

th1s 
1s 
a 
numer1c2__w0rld
numeric_world
Say
trHe11o 
W0r1d

Additional clarification: spaces are NOT allowed in the word. That's why I added \s in my regex.

Additional clarification: The words cannot end or start with _. However, "abcd_efgh" is valid.

wovano
  • 4,543
  • 5
  • 22
  • 49
watchtower
  • 4,140
  • 14
  • 50
  • 92
  • So spaces are allowed in a "word" if it's quoted by single quotes? – blhsing Sep 28 '18 at 05:08
  • No spaces are not allowed – watchtower Sep 28 '18 at 05:09
  • @blhsing: Thanks...I checked regex101, but I don't see it. I could be mistaken? – watchtower Sep 28 '18 at 05:11
  • Not sure I am following your question (one of the items in your expected output contains a space, but then you say that spaces are not allowed). Are you trying to match everything but single quotes, double quotes, and spaces? Something like `/[^'"\s]+/` – benvc Sep 28 '18 at 05:12
  • Yes, benvc. That's right. I am using regex to strip off all punctuation and spaces. `_`s must remain. – watchtower Sep 28 '18 at 05:13
  • And what would be the expected output for words with more than 2 underscores, e.g. `abc___xyz`? – blhsing Sep 28 '18 at 05:17
  • @blhsing - For words with two underscores, we could keep them. Underscores could also be dashes. I don't know how to handle dashes. – watchtower Sep 28 '18 at 05:18
  • So `abc___xyz` would become two words `abc` and `xyz`? Can a word begin or end with underscores? So is `_abc` considered a word? Can a word contain multiple occurrences of two or less underscores, e.g. is `abc__xyz__123` one word? – blhsing Sep 28 '18 at 05:19
  • 1
    @blhsing - No, a word cannot begin or end with an underscore or a dash. So, `abd_` or `_abd` are invalid. However, `abcd_def` is valid. – watchtower Sep 28 '18 at 05:22
  • What should a_b_c_d produce? – ysth Sep 30 '18 at 08:16
  • @ysth: `a_b_c_d`. Please see answer from ikegami. It also focuses on efficiency.` – watchtower Sep 30 '18 at 16:41

5 Answers5

4

If you were ok with leading and trailing _, you'd simply use the following:

\w+

If you wanted no _ at all, you'd simply use use the following:

[^\W_]+    # Like \w, but doesn't match "_"

So you could use the following:

[^\W_] \w* [^\W_] | [^\W_]

We can factor out [^\W_].

[^\W_] (?: \w* [^\W_] )?

That said, it's more efficient to view what you want to match as a bunch of "words" separated by underscores (e.g. word, word_word, word_word_word, etc) because it reduces backtracking on failed matches. So, we get the following:

[^\W_]+ (?: _+ [^\W_]+ )*          # Or  [^\W_]+ (?: _{1,2} [^\W_]+ )*

(Remove the spaces or use /x.)

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I think `[^\W_] \w+ [^\W_] | [^\W_]` should use an `*` rather than the `+` like `[^\W_] \w* [^\W_] | [^\W_]` so that it will match OP's example output of `1s` or the easier to read `8s` (since the 1 looks a bit like a lowercase L). – benvc Sep 28 '18 at 13:35
  • 1
    @benvc, I started with the final snippet, and tried to show a bit how I got there. But I guess I made a mistake! Fixed. – ikegami Sep 28 '18 at 17:05
  • Many thanks for the step by step. I did not have a real grasp on non-capturing groups until I worked through your answer. – benvc Sep 28 '18 at 17:09
  • 1
    @benvc, Oh that's easy: `(?:...)` is the same as `(...)`, just without the expense of capturing. – ikegami Sep 28 '18 at 17:11
  • Allows more than 2 consecutive _ – ysth Sep 30 '18 at 21:10
  • @ysth, 1) The OP doesn't say it shouldn't, 2) I actually provided a version that only allows at most 2. – ikegami Sep 30 '18 at 21:13
  • "there could be max two underscores" could be a parsing limitation or just a statement about the input data, but either way it wouldn't hurt to check – ysth Sep 30 '18 at 21:23
  • @ysth, Which is why my answer specially shows how to check – ikegami Sep 30 '18 at 21:24
1

This should work as you expected:-

([^-_"'\s][-]?\w*[^-_!'."\s]|[a-z]+)
yogesh10
  • 319
  • 2
  • 12
1

Maybe a simple negated set listing all the punctuation / space characters that are not allowed would work (handles one letter "words" just fine other than excluded characters).

For example, matches one or more of any character except exclamation marks, single quotes, double quotes, periods, or spaces (so it does allow hyphens, underscores, etc in addition to alphanumerics):

[^'"\!\.\s]+

EDIT (for additional requirement that words can't start or end with underscores or hyphens):

This one matches one or more of any character except exclamation marks, single quotes, double quotes, periods, or spaces (so it does allow hyphens, underscores, etc in addition to alphanumerics), but excludes matches that start or end with underscores or hyphens (uses pipe operator for an alternate expression to handle single character matches).

[^_'"\!\-\.\s][^'"\!\.\s]*[^_'"\!\-\.\s]|[^_'"\!\-\.\s]

Also, to avoid confusion for any future readers, the question as posted makes no mention of hyphens (that requirement is only noted in the comments), so here is some regex that assumes that matches should not include hyphens.

[^_'"\!\-\.\s][^'"\!\-\.\s]*[^_'"\!\-\.\s]|[^_'"\!\-\.\s]

That said, see the much more elegant answer from @ikegami that also prevents matching on other non-word characters such as commas, parentheses, etc.

benvc
  • 14,448
  • 4
  • 33
  • 54
0

if you want to cut out punctuation a simpler way might be this:

import string
punc = string.punctuation
a = "th1s 1s a numer1c2w0rld. Say 'He11o W0r1d!'"
a_mod = "".join([x for x in a if x not in punc]).split(" ")
Sven Harris
  • 2,884
  • 1
  • 10
  • 20
  • I don't want to use `join`. Sorry. This is a regex question. – watchtower Sep 28 '18 at 05:12
  • Ok no worries, maybe specify that at the bottom of your question to stop you getting more pure python answers. For regex I find https://www.regexpal.com/ useful – Sven Harris Sep 28 '18 at 05:18
0
a = "th1s 1s a numer1c2w0rld. S_ay 'He11o W0r1d!'"    
re.findall('([a-zA-Z0-9]([a-zA-Z0-9]*[-_]{0,2}[a-zA-Z0-9]*)?)', a)

Output

['th1s', '1s', 'a', 'numer1c2w0rld', 'S_ay', 'He11o', 'W0r1d']
Raunaq Jain
  • 917
  • 7
  • 13