4

Basically I found a slug function which looks like this:

function slug(string) => { 
    return string.toString().toLowerCase()
        .replace(/\s+/g, '-')
        .replace(/[^\w\-]+/g, '')
        .replace(/\-\-+/g, '-')
        .replace(/^-+/, '')
        .replace(/-+$/, '');
};

However, it doesn't seem to work for Russian, Greek, ... characters. Basically they are removed at this step .replace(/[^\w\-]+/g, '') which I don't want but I also want to remove other special characters which do not represent normal letters in some countries.

Example:

English | Do you know it rains? | do-you-know-it-rains

Czech | víš, že prší? | vis-ze-prsi

Romanian | Ști că plouă? | sti-ca-ploua

Russian | ты знаешь, что идет дождь? | ты-знаешь-что-идет-дождь

Note:

Basically for latin alphabet I will keep the letters but remove the diacritics, but for non-latin alphabet I will keep the letters as they are (I don't want to convert them into latin characters)

paulalexandru
  • 9,218
  • 7
  • 66
  • 94
  • See also: https://stackoverflow.com/questions/13309620/convert-javascript-utf-8-to-ascii-like-iconvutf-8-ascii-translit-strin – cmbuckley Feb 18 '19 at 10:12

1 Answers1

9

Here is an pproach that works for special character. Using a set of objects, you categorize every special character you want to replace under the latin character that will replace it.

However, to leave greek and russian untouched, you have to use a regex that considers greek and russian as word characters, so after replacing the special characters using the above trick, you have to remove all non-word characters using the following regex [^-a-zа-я\u0370-\u03ff\u1f00-\u1fff].

This regex includes the dash, the latin characters a-z followed by cyrillic а-я and finally the \u0370-\u03ff\u1f00-\u1fff which is the extended unicode range for greek characters.

You can use this wikipedia language recognition chart to add more special characters to the set.

function slugify(text) {
  text = text.toString().toLowerCase().trim();

  const sets = [
    {to: 'a', from: '[ÀÁÂÃÄÅÆĀĂĄẠẢẤẦẨẪẬẮẰẲẴẶἀ]'},
    {to: 'c', from: '[ÇĆĈČ]'},
    {to: 'd', from: '[ÐĎĐÞ]'},
    {to: 'e', from: '[ÈÉÊËĒĔĖĘĚẸẺẼẾỀỂỄỆ]'},
    {to: 'g', from: '[ĜĞĢǴ]'},
    {to: 'h', from: '[ĤḦ]'},
    {to: 'i', from: '[ÌÍÎÏĨĪĮİỈỊ]'},
    {to: 'j', from: '[Ĵ]'},
    {to: 'ij', from: '[IJ]'},
    {to: 'k', from: '[Ķ]'},
    {to: 'l', from: '[ĹĻĽŁ]'},
    {to: 'm', from: '[Ḿ]'},
    {to: 'n', from: '[ÑŃŅŇ]'},
    {to: 'o', from: '[ÒÓÔÕÖØŌŎŐỌỎỐỒỔỖỘỚỜỞỠỢǪǬƠ]'},
    {to: 'oe', from: '[Œ]'},
    {to: 'p', from: '[ṕ]'},
    {to: 'r', from: '[ŔŖŘ]'},
    {to: 's', from: '[ߌŜŞŠȘ]'},
    {to: 't', from: '[ŢŤ]'},
    {to: 'u', from: '[ÙÚÛÜŨŪŬŮŰŲỤỦỨỪỬỮỰƯ]'},
    {to: 'w', from: '[ẂŴẀẄ]'},
    {to: 'x', from: '[ẍ]'},
    {to: 'y', from: '[ÝŶŸỲỴỶỸ]'},
    {to: 'z', from: '[ŹŻŽ]'},
    {to: '-', from: '[·/_,:;\']'}
  ];

  sets.forEach(set => {
    text = text.replace(new RegExp(set.from,'gi'), set.to)
  });

  return text
    .replace(/\s+/g, '-')    // Replace spaces with -
    .replace(/[^-a-zа-я\u0370-\u03ff\u1f00-\u1fff]+/g, '') // Remove all non-word chars
    .replace(/--+/g, '-')    // Replace multiple - with single -
    .replace(/^-+/, '')      // Trim - from start of text
    .replace(/-+$/, '')      // Trim - from end of text
}

console.log(slugify('Do you know it rains?'));
console.log(slugify('víš, že prší?'));
console.log(slugify('Ști că plouă?'));
console.log(slugify('ты знаешь, что идет дождь?'));
console.log(slugify('ἀεὶ Λιβύη φέρει τι καινόν'));
jo_va
  • 13,504
  • 3
  • 23
  • 47
  • Hello, thank you for your work and answer. Maybe I wasn't so clear. Basically for Russian language I want to keep their letters in the URL. I want to remove the diacritics only from the latin letters. So basically I would like to convert ă into a, but Б I would like to keep it like this Б. (pretty url, so that the russian guys understand the url better) – paulalexandru Feb 18 '19 at 10:28
  • Ok I see, then the first snippet above will still work, you just have to remove the russian from the list – jo_va Feb 18 '19 at 10:30
  • No, actually it does not work, just check. The non-latin words are stripped off. – paulalexandru Feb 18 '19 at 10:36
  • Hello, thank you for the help. I will check your answer right now and come back with a response. So basically this is the thing that does the trick right ? [^а-яα-ωa-z-]+ basically it takes into consideration greek, russian and latin chars right ? – paulalexandru Feb 18 '19 at 10:53
  • Yes, a range for latin characters, one for greek and one for russian. I added an example for greek letters too, and it strips the accents. – jo_va Feb 18 '19 at 10:54
  • "ἀεὶ Λιβύη φέρει τι καινόν" this one is converted to "ε-λιβη-φρει-τι-καινν" , I'm not sure if that is ok. – paulalexandru Feb 18 '19 at 14:24
  • @paulalexandru, it is now working, I update the greek character range to its extended unicode range. – jo_va Feb 18 '19 at 15:00