18

I have the current regular expression:

/(?<=[\s>]|^)#(\w*[A-Za-z_]+\w*)/g

Which I'm testing against the string:

Here's a #hashtag and here is #not_a_tag; which should be different. Also testing: Mid#hash. #123 #!@£ and <p>#hash</p>

For my purposes there should only be two hashtags detected in this string. I'm wondering how to alter the expression such that it doesn't match hashtags that end with a ; in my example this is #not_a_tag;

Cheers.

Wex
  • 4,434
  • 3
  • 33
  • 47

7 Answers7

38

How about the following:

\B(\#[a-zA-Z]+\b)(?!;)

Regex Demo

  • \B -> Not a word boundary
  • (#[a-zA-Z]+\b) -> Capturing Group beginning with # followed by any number of a-z or A-Z with a word boundary at the end
  • (?!;) -> Not followed by ;
tk78
  • 937
  • 7
  • 14
11

This is the best practice.

(#+[a-zA-Z0-9(_)]{1,})
nhCoder
  • 451
  • 5
  • 11
  • 2
    Best answer on here, thank you. Only modification that may be needed is to allow åççéñts if your software will be international. Maybe something like `(#+[a-zA-Z0-9A-Za-zÀ-ÖØ-öø-ʸ(_)]{1,})` – Albert Renshaw Feb 20 '21 at 00:44
  • Perfect, but ####tag is also valid. UPD: `^#[a-zA-Z-а-яА-ЯÀ-ÖØ-öø-ʸ0-9(_)]{1,}$` – vusaldev Apr 07 '23 at 10:03
  • Why does this answer include brackets `()` as a valid hashtag character? Also why does it allow multiple hashtags like ##hashtag? Also why is `{1,}` used, if a simple `+` would be sufficient? – NicoHood Jul 01 '23 at 10:29
8
/(#(?:[^\x00-\x7F]|\w)+)/g

Starts with #, then at least one (+) ANCII symbols ([^\x00-\x7F], range excluding non-ANCII symbols) or word symbol (\w).

This one should cover cases including ANCII symbols like "#їжак".

ne4istb
  • 662
  • 5
  • 17
4

You can use a negative lookahead reegex:

/(?<=[\s>]|^)#(\w*[A-Za-z_]+\w*)\b(?!;)/
  • \b - word boundary ensures that we are at end of word
  • (?!;) - asserts that we don't have semi-colon at next position

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

Similar to anubhava's answer but swap the 2 instances of \w* with \d* as the only difference between \w and [A-Za-z_] is the 0-9 characters

This has the effect of reducing the number of steps from 588 to 90

(?<=[\s>])#(\d*[A-Za-z_]+\d*)\b(?!;)

Regex101 demo

garyh
  • 2,782
  • 1
  • 26
  • 28
1
(?<=(\s|^))#[^\s\!\@\#\$\%\^\&\*\(\)]+(?=(\s|$))

A regex code that matches any hashtag.

In this approach any character is accepted in hashtags except main signs !@#$%^&*()

Usage Notes

Turn on "g" and "m" flags when using!

It is tested for Java and JavaScript languages via https://regex101.com and VSCode tools.

It is available on this repo.

SVG-Heart
  • 161
  • 1
  • 5
  • Don't think your answer is answering OP questions: https://regex101.com/r/FFvPfn/1 OP doesn't want to match the semicolon. For the future it's better to share direct regex101 demo/snippet instead of just link to the landing page. – Anton Krug Apr 10 '21 at 15:54
0

You could try this pattern : /#\S+/

It will include all characters after # except for spaces.

Ajay Lingayat
  • 1,465
  • 1
  • 9
  • 25