0

I am trying to split a string into an array of single words in Javascript. First step was quite easy:

words = text.split(/\b\s+(?!$)/);

This solution works fine, except it doesn't use punctuation characters as separators. For example writing "Hello! How are you?", in the array of words I find "Hello!", "How", "are", "you?".

I solved this problem with a not very elegant solution (but it works!):

str= str.replace(",","");
str= str.replace(".","");
str= str.replace("!","");
str= str.replace("?","");

But there is still a big problem. If str contains any not english character (such as italian characters ò,à,è,ù), method split doesn't split the words.

For example if text is "Perché sei partito?", "Perché sei" is splitted into a single element of array words (as if it were a single word).

Any solution? Thanks a lot for helping!

3 Answers3

3

By using a regular expression that matches all non-english and english unicode characters, you can create your array. However, instead of using split, which tries to split the string by the matches, you can just use match to return your array of words.

var wordsRegex = /([^\x00-\x7F]|\w)+/g;
var sentence = 'Hello! How are you?';
console.log(sentence.match(wordsRegex));  //=> ['Hello', 'How', 'are', 'you']

sentence = 'Perché sei partito?';
console.log(sentence.match(wordsRegex));  //=> ['Perché', 'sei', 'partito']

One thing you'll need to be aware of though is that the regex only accounts for english punctuation, so if your string includes a latin punctuation (such as ¡), you would get those in the results.

sentence = 'Perché sei partito¡';
console.log(sentence.match(wordsRegex));  //=> ['Perché', 'sei', 'partito¡']

If you need to exclude non-english punctuation, you can add to the regex any unicode characters you want to exclude. Fair warning though if you try to exclude all possible non-english and english characters, you'll end up with a fairly large regex, so you might want to just consider excluding the most common ones and leaving others as "good enough." For example, not trying to exclude the ˥ symbol as it would be unlikely used in a common sentence.

Community
  • 1
  • 1
Steven Lambert
  • 5,571
  • 2
  • 29
  • 46
0

For a more elegant solution of removing punctuation, see here: How can I strip all punctuation from a string in JavaScript using regex?

To solve your issue of accented characters, consider using the following regex:

(?=\w|\W)\s+

This one picks up empty newlines though, but if you use the top solution in the question I linked, this should suffice to solve your problem:

(?=\w|\W)\s
Community
  • 1
  • 1
ryanlutgen
  • 2,951
  • 1
  • 21
  • 31
0

Another solution using String.match function:

var str = "Perché sei partito?",
    words = str.match(/[a-zA-Z\u00C0-\u1FFF\u2C00-\uD7FF]+\b/g);

console.log(words);   // ["Perch", "sei", "partito"]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105