2

I am writing a simple tool to allow people to label jobs on a server. The allowed characters are below. In addition to this, I want to allow emojis (any emojis).

I've referenced this stackoverflow question for the latter. But when I try to combine them, the code does not work.

Here's my function:

func SafeCharsRegex() *regexp.Regexp {
    regexString := "A-Za-z0-9._~!:@,;+-"
    emojiString := `[©®‼⁉™ℹ↔-↙↩↪⌨⏏⏭-⏯⏱⏲⏸-⏺Ⓜ▪▫▶◀◻◼☀-☄☎☑☘☠☢☣☦☪☮☯☸-☺♀♂♟♠♣♥♦♨♻♾⚒⚔-⚗⚙⚛⚜⚠⚧⚰⚱⛈⛏⛑⛓⛩⛰⛱⛴⛷⛸✂✈✉✏✒✔✖✝✡✳✴❄❇❣➡⤴⤵⬅-⬇〰〽㊗㊙----------]️?|[☝✌✍][️-]?|[⛹](?:‍[♀♂]️?|[️-](?:‍[♀♂]️?)?)?|[✊✋-------][-]?|❤(?:‍[]|️(?:‍[])?)?|[--]|[---]|[---]|[]|[-]|[-]|[---]|[]|[---]|[]|[-]|[--]|[--]|[-]||[---]||[]|[----]|[--]|[]|[]|[]||[]|[]|[-----](?:‍[♀♂]️?|[-](?:‍[♀♂]️?)?)?|(?:‍(?:⚧️?|)|️(?:‍(?:⚧️?|))?)?|(?:‍☠️?|(?:||))?|(?:‍⬛)?|(?:‍)?|(?:‍❄️?)?|(?:‍️?|️(?:‍️?)?)?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?|(?:‍)?|(?:‍[])?|[]‍(?:(?:‍)?|(?:‍[])?)|[-])|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[-]|‍[-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[-]|‍[-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[-]|‍[]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[-]|‍[-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[-]|‍[-]|[-]))?)?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:[]|‍[])|(?:‍)?|(?:‍[])?|‍(?:(?:‍)?|(?:‍[])?)|[-])|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[][-]|‍[][-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[][-]|‍[][-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[][-]|‍[][]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[][-]|‍[][-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[][-]|‍[][-]|[-]))?)?|[](?:‍[♀♂]️?)?|(?:‍)?|(?:‍)?|(?:‍️?)?|(?:‍(?:[⚕⚖✈]️?|‍|[-])|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[-]|‍[-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[-]|‍[-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[]|‍[-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[-]|‍[-]|[-]))?|(?:‍(?:[⚕⚖✈]️?|❤️?‍(?:‍)?[-]|‍[-]|[-]))?)?|[⌚⌛⏩-⏬⏰⏳◽◾☔☕♈-♓♿⚓⚡⚪⚫⚽⚾⛄⛅⛎⛔⛪⛲⛳⛵⛺⛽✅✨❌❎❓-❕❗➕-➗➰➿⬛⬜⭐⭕-----------------------------------------------------]|(?:(?:‍[-])?|(?:‍[-])?|(?:‍[])?|(?:‍[-])?|(?:‍[-])?)?`

    r := regexp.MustCompile(fmt.Sprintf("[^%s|%s]", regexString, emojiString))
    return r
}

and here's how I'm using it.

    s := ""
    r := SafeCharsRegex()
    safeString := r.ReplaceAllString(s, "")

EXPECTED BEHAVIOR: safeString == ""

ACTUAL BEHAVIOR: safeString == ""

I've tried removing the concetatenation, doing just the emoji string, etc, and nothing works. However, when I make the emojiString JUST those four emojis, it does work. So there's something with the VERY complex emoji string that's broken.

HOWEVER, when I go here - https://regex101.com/r/iCzyv2/1 - it works fine.

aronchick
  • 6,786
  • 9
  • 48
  • 75

1 Answers1

0

A better way may be thinking in terms of unicode. Unicode has something called character category (or class), which are tagged with a specific "name". And some regex specifications define tokens that matches specific classes. From what I see you want to match letters (which has the name L, note that this includes letters like é), numbers (with the name N), punctuation (named P), and symbols (named S).

By reading the documentation of the regex package, we can learn that it implements the RE2 syntax. And to specify a token that matches an unicode character class, like numbers, we can use \p{N}. So, we can define a regex like:

[\p{L}\p{P}\p{S}\p{N}]+

Which will fully match the string thisis_atest998

Considering that since you have such a permissive filter (what is not allowed here? whitespaces?), it may be better to find for non-allowed symbols instead.

Eduardo Thales
  • 401
  • 1
  • 8
  • Thanks! I dunno - my mental space is to go to only allowed characters in such a large attack surface. I'll mess around with this syntax though, it may be an elegant solution. My goal was to allow a very small set of characters that are allowed in a URL without translation (the regex string) and every emoji. – aronchick Jun 19 '22 at 18:13