3

I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_

EDIT #1: Thanks for all your comments.

To be clear:

  1. I'd like to set the characters that would be the word delimiters
  2. Lets call this the "Delimiter Set", or strDelimiters
  3. strDelimiters = ".,;:!?-*_"
  4. nNumWordsToFind = 5
  5. A word is defined as any contiguous text that does NOT contain any character in strDelimiters
  6. The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
  7. I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.

EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT

@maraca definitely answered my question as originally stated. But what I actually need is to return the number of words ≤ nNumWordsToFind. So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.

For example:

one,two;three-four_five.six:seven eight    nine! ten

It would see this as 10 words. If I want the first 5 words, it would return:

one,two;three-four_five.

I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:

([\w]+\s+){<NumWordsOut>}

where <NumWordsOut> is the number of words to return.

I have also found this word boundary pattern, but I don't know how to use it:

a "real word boundary" that detects the edge between an ASCII letter and a non-letter.

(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])

However, I would want my words to allow numbers as well.

IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.

BTW, I will be using this in a Keyboard Maestro macro.

Can anyone help? TIA.

JMichaelTX
  • 1,659
  • 14
  • 19
  • This would all depend on the regex language. – Anonymous Aug 08 '15 at 01:33
  • Also, when you say *"plus punctuation like `.,;:!?-*_`"*. Do you mean *exactly* those characters or similar characters. If the latter, you should specify *exactly* which characters you intend to use as separators. – Anonymous Aug 08 '15 at 01:35
  • You should also define exactly which characters qualify as word characters. Basically, be as specific as possible. – Anonymous Aug 08 '15 at 01:38
  • 1
    Thanks for all your comments and suggestions. I have updated my original post to provide the specificity you requested. – JMichaelTX Aug 08 '15 at 02:18
  • Much better, but it would be helpful if you could also specify what regular expression language you're using. – Anonymous Aug 08 '15 at 03:04

2 Answers2

1

All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:

^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1.             2.              3.             4.  5.
  1. Match any amount of delimiters before the first word
  2. Match a word (= at least one non-delimiter)
  3. The word has to be followed by at least one delimiter
  4. Or it can be at the end of the string (in case no delimiter follows at the end)
  5. Repeat 2. to 4. <NumWordsOut> times

Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.

maraca
  • 8,468
  • 3
  • 23
  • 45
  • 1
    GREAT answer! This definitely answers my question. May I ask a follow-on question? How can I return the number of words ≤ NumWordsOut ? If my source text has only 3 words, but my RegEx asks for 4, then it fails and returns nothing. I want it to return however many words it finds up to but no greater than NumWordsOut. How can I do this? – JMichaelTX Aug 08 '15 at 05:41
  • @JMichaelTX you can use `{0,}` to only restrict the upper bound and accept also fewer words. Sometimes `{,}` works too. – maraca Aug 08 '15 at 10:51
  • Thanks @maraca! That works like a charm! Perfect! Problem completely resolved! – JMichaelTX Aug 08 '15 at 21:11
0

Thanks to @maraca for providing the complete answer to my question.

I just wanted to post the Keyboard Maestro macro that I have built using @maraca's RegEx pattern for anyone interested in the complete solution.

See KM Forum Macro: Get a Max of N Words in String Using RegEx

Community
  • 1
  • 1
JMichaelTX
  • 1,659
  • 14
  • 19