How do I match text with a regular expression ignoring punctuation and line breaks

Question

I have an app where I need to find the position of a list of words in a passage of text. A regex is blatantly the way to do this but the issue I have is that I may have all kinds of punctuation or new lines between words. How do I do "find these words possibly separated but some non-alphanumeric characters"?

UPDATE:

An example would be that I need to find the range of:

shouted help these regular expressions are horrible so

in

The developer shouted "help", these regular expressions are horrible! So, please help me :(

Could you give us an text example ? – Wilmer Jul 07 '16 at 21:38 — Wilmer, Jul 07 '16 at 21:38

score -1 · Answer 1 · edited Jul 07 '16 at 23:02

Description

\b(?:[a-z](?:[a-z\n\r.:;,?!-]*[a-z])?)\b

_{** Click for bigger image}

This regular expression will do the following:

Requires all words to start and end with a-z, or be a single letter long
Allows words to contain new line characters, or common punctuation like .:;,?!-
Words are not allowed to contain spaces

Example

Live Demo

https://regex101.com/r/bK4oO8/1

Sample text

How do I match text with a regular expres
sion ignoring punctuation and line breaks?
How do I do "find these words pos-
sibly separated but some non-alphanumeric characters"?

Sample Matches

MATCH 1
0.  [0-3]   `How`

MATCH 2
0.  [4-6]   `do`

MATCH 3
0.  [7-8]   `I`

MATCH 4
0.  [9-14]  `match`

MATCH 5
0.  [15-19] `text`

MATCH 6
0.  [20-24] `with`

MATCH 7
0.  [25-26] `a`

MATCH 8
0.  [27-34] `regular`

MATCH 9
0.  [35-46] `expres
sion`

MATCH 10
0.  [47-55] `ignoring`

MATCH 11
0.  [56-67] `punctuation`

MATCH 12
0.  [68-71] `and`

MATCH 13
0.  [72-76] `line`

MATCH 14
0.  [77-88] `breaks?
How`

MATCH 15
0.  [89-91] `do`

MATCH 16
0.  [92-93] `I`

MATCH 17
0.  [94-96] `do`

MATCH 18
0.  [98-102]    `find`

MATCH 19
0.  [103-108]   `these`

MATCH 20
0.  [109-114]   `words`

MATCH 21
0.  [115-125]   `pos-
sibly`

MATCH 22
0.  [126-135]   `separated`

MATCH 23
0.  [136-139]   `but`

MATCH 24
0.  [140-144]   `some`

MATCH 25
0.  [145-161]   `non-alphanumeric`

MATCH 26
0.  [162-172]   `characters`

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    [a-z]                    any character of: 'a' to 'z'
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      [a-z\n\r.:;,?!-          any character of: 'a' to 'z', '\n'
      ]*                       (newline), '\r' (carriage return),
                               '.', ':', ';', ',', '?', '!', '-' (0
                               or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
      [a-z]                    any character of: 'a' to 'z'
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------

Extra Credit

If you also want to eliminate matches like #14 above, where you have a ? which is followed by a new line character. When in this configuration the ? should not be considered to be part of the word, where as a - followed by a new line is really a hyphen. Then you should consider this

\b(?:[a-z](?:(?:[a-z-]+|[.:;,?!-]+(?![\n\r])|[\n\r]+)*[a-z])?)\b

Live Demo: https://regex101.com/r/bK4oO8/2

Brilliantly written answer but that just matches all words. What I need to do is like find the range of say 'I do "find these' by matching from 'I do find these' (ignoring the "). That make sense? — Martin, Jul 08 '16 at 08:34
I see you're update but it's not clear why you're skipping the first two words or the last three. — Ro Yo Mi, Jul 08 '16 at 11:51
Just because that's what I have to do: find if a sequence of words are somewhere in a block of text and if so where. — Martin, Jul 08 '16 at 13:38

score -1 · Answer 2 · answered Jul 08 '16 at 09:42

I figured it out:

let pattern = String(format: "(\\b%@\\b)",words.joinWithSeparator("[^a-zA-Z\\d\\s:]?[ ]"))

the '\b' gives word boundaries then it matches words separated but an optional punctuation character and then a space. I will probably have to add a few bits for double punctuation but it works for now.

How do I match text with a regular expression ignoring punctuation and line breaks

2 Answers2

Description

Example

Explanation

Extra Credit