1

It's a recurring topic, but I haven't been able to find a good solution. I have words I need to match with the content of my page with regex in javascript, and those absolutely have to be words, not parts of words. However some of them start or end with a letter from this set: [zżźćńółęąśŻŹĆĄŚĘŁÓŃA].

Word boundaries obviously do not work with these at the end or the beginning. Replacing them with their unicode counterparts doesn't seem to work either.

Right now I'm using a hack: I assigned numbers from 1 to 9 to lowercase letters from the list, and I'm checking if any letter in a word matches any key from the character dictionary. If it does, it gets replaced with a number, then I replace it the same way in the content I need to match against.

It kinda works, but it's a half-measure, and it means the regex is no longer case-sensitive, which I would really like to have.

Surely there has to be a clean solution?

EDIT as asked in a comment...

/\bbudyń\b/g

budyń budyńasda

Matches bold, should match the first word and leave the other one intact.

/\bósemka\b/g

ósemka asdaósemka

Likewise.

zephi
  • 418
  • 3
  • 16
  • 1
    Please show the entire regex you are currently trying - preferably, the one with your "does not work" set at the end. Also, add a couple of words on which it fails. – Jongware Feb 23 '16 at 23:17
  • [An answer that might be helpful...](http://stackoverflow.com/a/2449892/5527985) [`(^|[^\wÀ-ÖØ-öø-ſ])(budyńń|ósemka|asdaósemka)(?![\wÀ-ÖØ-öø-ſ])`](https://regex101.com/r/jZ2sJ7/1) – bobble bubble Feb 24 '16 at 01:54
  • Unfortunately it matches one whitespace before the word. – zephi Feb 24 '16 at 11:19

1 Answers1

0

Try this:

/(?:\s|^)(ósemka)(?=\s|$)/g

The one above assumes that the word is followed by only a white space character (or end of string). But if there are other characters that may follow the word such as period, question mark, etc, then this should work.

/(?:\s|^)(ósemka)(?=[\s\.\?!;]|$)/g
Jon
  • 814
  • 2
  • 8
  • 11
  • The problem with this is it matches one whitespace before the word. Any idea how to get rid of it? – zephi Feb 24 '16 at 10:59
  • @zephi: You have to extract the first capturing group. That is the string you're interested in without the whitespace. – Jon Feb 24 '16 at 11:21
  • I'm sorry, I don't follow. Please look at this: https://regex101.com/r/jZ2sJ7/2 , I don't understand what exactly you mean by 'extract' the first capturing group. Isn't that the whole point of this question? – zephi Feb 24 '16 at 11:39
  • I see. Unfortunately, if white space disqualifies your matches, I'm not sure there'd be a way, given that word boundaries won't work. Maybe if I had a little more of an idea of what you're doing with the matches (what they tell you, etc), I could help. Are you replacing with something else? If that's the case, then you're still in business. – Jon Feb 24 '16 at 12:10
  • Sure. I have a list of words, and a phrase taken directly from the content of a page. I need to find every word that is both in the phrase and in the list, and add a container with highlighting CSS class. So it's pretty much just a simple word highlight, but due to that the match needs to be exact and case-sensitive. – zephi Feb 24 '16 at 12:19
  • Maybe from that info, someone else will be able to add something, but I think I've offered all I can. – Jon Feb 24 '16 at 12:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/104440/discussion-between-jon-and-zephi). – Jon Feb 24 '16 at 15:34