Sanitizing strings with regex, normalize strips entire japanese characters

Question

I previously made a question on how to use regex to sanitize a string that will be used on an URL. The code I am using in my API is this:

const newPost = new Post({
    category: req.body.category,
    user: req.body.user,
    title: req.body.title,
    content: req.body.content,
    shadowBanned: req.body.shadowBanned,
    community: req.body.community,
    categoryMeta: req.body.categoryMeta,
    image: req.body.imageUrl,
    externalLink: req.body.link,
    urlHash: req.body.title
      .replace(/(\s|-)/g, '_')
      .normalize('NFKD')
      .replace(/\W/g, '')
      .toLowerCase(),
    postUrl:
      req.body.community +
      '/c/' +
      req.body.category +
      '/p/' +
      req.body.title
        .replace(/(\s|-)/g, '_')
        .normalize('NFKD')
        .replace(/\W/g, '')
        .toLowerCase(),
    source: req.body.link
      ? new URL(req.body.link).hostname.replace('www.', '')
      : '',
  });

The regex on urlHash and postUrl works well with western languages, it will output strings formatted like esp/c/CategoryName/ThisIsMyTitleWithoutSymbolsAndSpaces (title case pun intended for reading purposes) but when it comes to use Japanese or Korean characters, that title will be empty or left with some characters, here's an example:

const text = '「Apex Legends」次期大型アップデート“デファイアンス”のローンチトレイラーが公開に'
// will output apex_legends

If there's any other western character, the replace() code will just wipe everything.

Is there a way to avoid this or do I need to use a different condition if language is equal to any other non western language? I would love not using conditionals.

I don't think it's a problem to use Japanese characters in the URL, but symbols.

Your previous question was flagged as a duplicate, which suggested using the built in URI encoding functions (`encodeURIComponent()` and `encodeURI()`) Is there a reason you can't use those? — DBS, Jan 28 '22 at 10:06
Does this help? https://stackoverflow.com/questions/6787716/regular-expression-for-japanese-characters — evolutionxbox, Jan 28 '22 at 10:08
Yes, it helps, i can generate an unique string using decoding. — Minide, Jan 28 '22 at 12:48

Sanitizing strings with regex, normalize strips entire japanese characters

0 Answers0