0

The task I'm trying to accomplish is crawling a list of web pages and see if there are any common misspellings on said pages.

Here's where I stuck.

I'm using this RegEx (\W|^)(therefor|wich|sence)(\W|$) Which finds misspelled words from this string: therefor, therefore, which, wich whichita, presence , sence, sence and ignores words where misspellings are part of another word

regex test screenshot

Problem is, results (words) are moved to the second capturing group and parser I'm using only shows results from 1st group.

So I can see that certain page has one of the misspellings, but I don't know which one since it's in group 2.

Is there a way to change the order of groups in RegEx?

PS: using something else to find misspelled words is not an option, I need to be able to perform this task with a scraper I use.

Thanks in advance.

  • Does you engine support `\b` word boundary? Use `\b(therefor|wich|sence)\b` – Wiktor Stribiżew Apr 04 '18 at 13:32
  • Do you have to use regular expressions when there are spell checking libraries available? – Sean Bright Apr 04 '18 at 13:32
  • @WiktorStribiżew Yes. this seems to be it, guess I was overthinking it. I tried (\b) in brackets creating an unnecessary group, don't even know why. Thanks a lot. – Igor Gorbenko Apr 04 '18 at 13:50
  • Of course `(...)` creates a group. Any pair of matching unescaped parentheses creates a group (if it is not a POSIX BRE pattern) – Wiktor Stribiżew Apr 04 '18 at 13:51
  • @WiktorStribiżew \b has a flaw. it doesn't work if the line starts with a misspelled word. but it got me thinking and I turned first and last groups into non-capturing groups with ?: (?:\W|^)(therefor|wich|sence)(?:\W|$) in case you'll need a similar solution – Igor Gorbenko Apr 04 '18 at 13:59
  • You are wrong. `\bw` will match `w` at the start of the line or after a non-word char. – Wiktor Stribiżew Apr 04 '18 at 14:08

0 Answers0