0

I'm trying to use one script for keyword density. Everything works except for foreign letters (be it swedish, Estonian, or anything else).

$file includes the text.

Here's where the problem comes in:

$testsource = explode(" ", $file); // This has no problems with non-english letters

FIRST WORD in array: "Mängi"

$source = preg_split("/[(\b\W+\b)]/",  $file, 0, PREG_SPLIT_NO_EMPTY); // This removes the non-english letter sometimes and also a letter in front of it

FIRST WORD in array: "ngi"

In case of this specific word the problem seems to be the "ä" character (and in case of other words other non-english characters) as my current preg_split removes the "Mä" from the beginning of the word. Words with no special characters are ok.

Question: What can I add to the preg_split not to cause issues?

mediacurse
  • 47
  • 5

1 Answers1

0

Ah, never mind, the answer is to change the preg_split line to the following:

$source = preg_split("/[(\b\+\b)\s!@#$%*]/",  $file, 0, PREG_SPLIT_NO_EMPTY); 
mediacurse
  • 47
  • 5