1

I am very close to solving this thanks to this post Regex find word in the string

But I am still not 100% there.

If I use this regex along with Apache's BrowserMatchNoCase

^(.*?)(\b360Spider\b)(.*)$

I get the following results:

  • 360Spider = match
  • 360spider = match
  • 360SpIdEr = match
  • 360spiders = no match
  • Not360Spider = no match
  • Not-360Spider = match
  • Not-360spider = match

I need it to match the word 360Spider regardless of what is put in front or after the word, so NOT360Spider should be a match.

Thanks in advance, my regex has improved somewhat over the years but I am still nowhere close to fully understanding getting things perfect without leading to false positives.

At the same time I do not want to introduce other false positives which is why I am delving into this in the first place so other user-agent names likes "Exabot" and "Alexabot" I don't want the "exabot" part of Alexabot to be detected.

So let's say in another example:

^(.*?)(\bExabot\b)(.*)$

I get the following results:

  • Alexabot = no match
  • Exabot = match
  • exAbot = match

If I remove word boundaries "\b" as follows:

^(.*?)(Exabot)(.*)$

I get the following results:

  • Alexabot = match
  • Exabot = match
  • exAbot = match
  • anythingExabot = match

So I guess I have to stick with the word boundaries "\b" now the trick is to get printf to write the "\b" into my string and not see it as a backspace character.

MitchellK
  • 2,322
  • 1
  • 16
  • 25
  • 1
    Remove word boundaries `\b`. It will also match `360spiders`, by the way. – Wiktor Stribiżew Jun 26 '17 at 09:52
  • *Is this even possible?* - No, it is not possible to understand what you are asking. It is surely not possible to match `exabot` with `^(.*?)(\b360Spider\b)(.*)$`. – Wiktor Stribiżew Jun 26 '17 at 10:05
  • Thanks guys I updated my question with a few more examples, seems I have to stick with `\b` word boundaries – MitchellK Jun 26 '17 at 10:28
  • To define a literal ``\`` in a regular string literal it is usually required to put double backslash. It is not necessary if you define a pattern in some text file that is read in and then parsed by an engine. – Wiktor Stribiżew Jun 26 '17 at 10:34
  • Thanks I figured out my printf syntax in my bash script `printf "BrowserMatchNoCase \"^(.*?)(\\\b${line}\\\b)(.*)$\" good_bot\n"` – MitchellK Jun 26 '17 at 11:11

1 Answers1

1

Note that once you add word boundaries around 360Spider you can't match it inside another word, enclosed with digits or even _ symbols that are also considered word chars.

If you need to match the word anywhere inside a string, you need to remove word boundaries, \b. However, judging by your examples, you still need the word boundaries as otherwise, you will match exabot in Alexabot.

Here is a way to define your pattern in Bash:

#!/bin/bash
line='var_here'
printf "BrowserMatchNoCase \"^(.*?)(\\\b${line}\\\b)(.*)\$\" good_bot\n"

See an online demo. Note it is a good idea to escape the $ inside an interpolated string literal.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks Wiktor, I can see why escaping that last $ in the printf string is important. Indeed I am stuck with word boundaries, without them there's just a whole world of false positives. This new regex is working 100% now. Strange though that `-` is ignored by the word boundary. – MitchellK Jun 26 '17 at 12:40
  • @MitchellK: You are welcome. If you need to tweak the word boundaries, feel free to drop a line. `-` is not ignored by a word boundary, the word boundary exists between `a` and `-`, between `-` and `a`, but not between `;` and `-`. – Wiktor Stribiżew Jun 26 '17 at 12:42
  • Thanks again Wiktor, you're a star. I will update my original question to become more of a question & answer. – MitchellK Jun 27 '17 at 09:52