6

I'm trying to match nodes in a Neo4j database. The nodes have a property called "name" and I'm using regular expression in Cypher to match this. I only want to match whole words, so "javascript" should not match if I supply the string "java". If the string to match is of several words, i.e. "java script" I will do two seperate queries, one for "java" and one for "script".

This is what I have so far:

match (n) where n.name =~ '(?i).*\\bMYSTRING\\b.*' return n

This works, but it does not work with some special characters like "+" or "#". So I cant search for "C++" or "C#" etc. The regular expression in the above code is just using \b for word boundary. it is also escaping it so it works correctly.

I tried some versions of this post: regex to match word boundary beginning with special characters but it didnt really work, maybe I did something wrong.

How can I make this work with special characters in Cypher and Neo4j?

Community
  • 1
  • 1
Øyvind
  • 839
  • 2
  • 15
  • 22

2 Answers2

3

Try escaping the special characters and look for non-word characters rather than word boundaries. For example;

match (n) where n.name =~ '(?i).*(?:\\W|^)C\\+\\+(?:\\W|$).*' return n

Although this still has some false positives, for example the above will match "c+++".

For "Non word character, except that we want to treat + as a word character" the following could work.

match (n) where n.name =~ '(?i).*(?:[\\W-[+]]|^)C\\+\\+(?:[\\W-[+]]|$).*' return n

Although this is not supported by all regexp flavors, and I am not sure if Neo4j supports this.

Taemyr
  • 3,407
  • 16
  • 26
  • 1
    This would normally work, but the \b word boundary only operates with alphanumerical characters so it does not match properties like "c++" (which either starts or ends with a special character). It would match properties like "c++c" since it ends with an "c". – Øyvind Sep 18 '14 at 11:16
  • This was working, but it also matched if there was characters before or after the string as you mention in the updated answer. Is there a way to make it only match on whole words? Wouldn't the updated answer be the same as '(?i).*C\\+\\+.*' ? – Øyvind Sep 18 '14 at 11:41
  • @Øyvind The answer is not the same as '(?i).*C\\+\\+.*', as that would match c++c. – Taemyr Sep 18 '14 at 13:13
  • @Øyvind To make it match only on whole words you need to be explicit on what you mean with whole words. The regex engine consider word boundary as a boundary between word characters and non-word characters, but that does not work for you since you want to treat "c++c" as a single word. You could replace "\\W" with "\\s" to look for matches flanked by whitespace. Or with " " to look for matches flanked by the space character. Or with a custom charactergroup that fits your criteria. – Taemyr Sep 18 '14 at 13:18
  • Ok, my boundary is whitespace, that is, words that are separated. It worked as expected (with limited testing) when I replaced \\W with \\s. Do I need to check for the space character as well as whitespace, they are not the same? – Øyvind Sep 18 '14 at 13:26
  • @Øyvind \s contains the space character, but also other characters. So if you are checking for \s you do not need to look for space, however you can get hits that are not the space character. – Taemyr Sep 18 '14 at 13:31
  • @Øyvind You might also want to consider punctuation. "Sentences ending with c++." Will not match if you use \s. – Taemyr Sep 18 '14 at 13:35
  • Ok, thanks for the tips regarding \s! Is there some way I can use the answer in the post I have linked in my question? It says that \b is a shorthand for (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) and if I want to add more characters as word character then I can change it to something like (?:(?<![\w~])(?=[\w~])|(?<=[\w~])(?![\w~])) where ~ now is a word character. However if I write something like '(?i).*(?:(?<![w+])(?=[w+])|(?<=[w+])(?![w+]))c\\+\\+(?:(?<![w+])(?=[w+])|(?<=[w+])(?![w+])).*' It does not work. It would solve my problem completely if I could just add a few characters as word characters – Øyvind Sep 19 '14 at 06:20
  • Seems like I was missing a few backslashes for escaping other backslashes. – Øyvind Sep 19 '14 at 06:24
  • @Øyvind since you are returning the whole string. And know if you are at the beginning or end of your word you don't need to use as many lookarounds as that example. – Taemyr Sep 19 '14 at 06:27
  • You are right, this is what I ended up with: '(?i).*(?<![\\w+#])$match(?![\\w+#]).*', feel free to update your answer and I will mark it as correct :) This adds the characters "+" and "#" as word characters. – Øyvind Sep 19 '14 at 06:45
1

You can assert white-spaces (or nothing at all - boundary of match) ahead and behind your match instead of asserting word boundaries. See this:

(?i).*(?<!\\S)MYSTRING(?!\\S).*

Here, you can fiddle with a regex demo. It will only match your string if it is between whitespaces or boundaries for front and after your word. You can define "punctuation" if you need, like this:

(?i).*(?<![^\\s.,$])MYSTRING(?![^\\s.,$]).*
               ^^^  add boundaries  ^^^

Then it will match rawrssss MYSTRING. dd also.

See a regex demo!

Unihedron
  • 10,902
  • 13
  • 62
  • 72