I previously made a question on how to use regex to sanitize a string that will be used on an URL. The code I am using in my API is this:
const newPost = new Post({
category: req.body.category,
user: req.body.user,
title: req.body.title,
content: req.body.content,
shadowBanned: req.body.shadowBanned,
community: req.body.community,
categoryMeta: req.body.categoryMeta,
image: req.body.imageUrl,
externalLink: req.body.link,
urlHash: req.body.title
.replace(/(\s|-)/g, '_')
.normalize('NFKD')
.replace(/\W/g, '')
.toLowerCase(),
postUrl:
req.body.community +
'/c/' +
req.body.category +
'/p/' +
req.body.title
.replace(/(\s|-)/g, '_')
.normalize('NFKD')
.replace(/\W/g, '')
.toLowerCase(),
source: req.body.link
? new URL(req.body.link).hostname.replace('www.', '')
: '',
});
The regex on urlHash and postUrl works well with western languages, it will output strings formatted like esp/c/CategoryName/ThisIsMyTitleWithoutSymbolsAndSpaces (title case pun intended for reading purposes) but when it comes to use Japanese or Korean characters, that title will be empty or left with some characters, here's an example:
const text = '「Apex Legends」次期大型アップデート“デファイアンス”のローンチトレイラーが公開に'
// will output apex_legends
If there's any other western character, the replace() code will just wipe everything.
Is there a way to avoid this or do I need to use a different condition if language is equal to any other non western language? I would love not using conditionals.
I don't think it's a problem to use Japanese characters in the URL, but symbols.