[please see below for answer]
I am using preg_match_all to extract Hashtags from strings,for example:
#tree#ztdf #n4# night
contains hashtags: tree, ztdf, n4, night
Strings can be any language with any characters, even emojis. Therefore, I enabled the utf-8 flag (/u) in my preg_match_all
:
preg_match_all('/#([\pL\p{Mn}]+)/u', $media_caption, $matches);
However, some characters are wrongly matched by their byte sequences:
I read that this is a problem with preg_match_all, utf-8 encoding and php here. I also tried to add the additional utf-8 flag (*UTF8) from pcre:
preg_match_all('(*UTF8)/#([\p{L}\p{Mn}]+)/u', $media_caption, $matches)
.. but then I am getting this error
syntax error, unexpected 'Enabled' T-flag
Anyone knows how I can extract #hashtags with any utf-8 character using preg_match_all
?
[Edit]
Ok.. another day, back to the problem: so I realized yesterday, the garbled characters I got after json_decode() are only from looking at the output with Windows command line, which can't handle UTF8. Today I run the program using Git Bash Console and
- it shows the input to preg_match_all just looking fine in UTF8.
- after this, also no problems: str_replace(array("\r\n", "\r", "\n",","), ";", $media_caption);
(replace all linebreaks)
- and no problems after this: preg_replace('
!\s+!u', ' ', $media_caption);
(replace multiple space characters with only one)
- now the funny part: it even looks fine after this: preg_match_all('/#([\p{L}\p{Mn}]+)/u', $media_caption, $matches);
For example, var_dump for the following string is this in Git Bash:
string(15) "presadebuendía"
.. but in written csv/txt it is this: presadebuend㮡
while this Embalse de Buendía
is correctly written to the file.
I am currently looking into parts of my code that may mess with character encoding during data processing. So far, I have tried:
header('Content-Encoding: UTF-8');
header('Content-type: text/csv; charset=UTF-8');
mb_internal_encoding("UTF-8");
and replacing fopen with this function:
function utf8_fopen_read($fileName) {
$fc = iconv('windows-1250', 'utf-8', file_get_contents($fileName));
$handle=fopen("php://memory", "rw");
fwrite($handle, $fc);
fseek($handle, 0);
return $handle;
}
.. but none of this solved the issue.