0

[please see below for answer]

I am using preg_match_all to extract Hashtags from strings,for example:

#tree#ztdf #n4# night

contains hashtags: tree, ztdf, n4, night

Strings can be any language with any characters, even emojis. Therefore, I enabled the utf-8 flag (/u) in my preg_match_all:

preg_match_all('/#([\pL\p{Mn}]+)/u', $media_caption,  $matches);

However, some characters are wrongly matched by their byte sequences:

enter image description here

I read that this is a problem with preg_match_all, utf-8 encoding and php here. I also tried to add the additional utf-8 flag (*UTF8) from pcre:

preg_match_all('(*UTF8)/#([\p{L}\p{Mn}]+)/u', $media_caption,  $matches)

.. but then I am getting this error

syntax error, unexpected 'Enabled' T-flag

Anyone knows how I can extract #hashtags with any utf-8 character using preg_match_all?

[Edit]

Ok.. another day, back to the problem: so I realized yesterday, the garbled characters I got after json_decode() are only from looking at the output with Windows command line, which can't handle UTF8. Today I run the program using Git Bash Console and - it shows the input to preg_match_all just looking fine in UTF8. - after this, also no problems: str_replace(array("\r\n", "\r", "\n",","), ";", $media_caption); (replace all linebreaks) - and no problems after this: preg_replace('!\s+!u', ' ', $media_caption); (replace multiple space characters with only one) - now the funny part: it even looks fine after this: preg_match_all('/#([\p{L}\p{Mn}]+)/u', $media_caption, $matches);

For example, var_dump for the following string is this in Git Bash:

 string(15) "presadebuendía"

.. but in written csv/txt it is this: presadebuend㮡 while this Embalse de Buendía is correctly written to the file.

I am currently looking into parts of my code that may mess with character encoding during data processing. So far, I have tried:

  • header('Content-Encoding: UTF-8');
  • header('Content-type: text/csv; charset=UTF-8');
  • mb_internal_encoding("UTF-8"); and replacing fopen with this function:
function utf8_fopen_read($fileName) { 
    $fc = iconv('windows-1250', 'utf-8', file_get_contents($fileName)); 
    $handle=fopen("php://memory", "rw"); 
    fwrite($handle, $fc); 
    fseek($handle, 0); 
    return $handle; 
} 

.. but none of this solved the issue.

Alex
  • 2,784
  • 2
  • 32
  • 46
  • Have you tried using utf-16 or utf-32? Although utf-8 usually works – MirzaS Aug 21 '17 at 12:17
  • once visit this [link](https://bugs.php.net/bug.php?id=37391). – Rahul Aug 21 '17 at 12:20
  • Not a direct answer, but you could try turning it around: Match everything except word boundaries and hash characters: `/#([^\b#]+)/u` – jeroen Aug 21 '17 at 12:21
  • Strangely, none of your suggestions works: for example this is the correct hashtag as taken from Input #LujánDeCuyo and this is what I got extracted luj㢮decuyo – Alex Aug 21 '17 at 12:39
  • @jeroen: `\b` inside a character class means backspace, not word boundary. – Toto Aug 22 '17 at 13:26
  • @Toto Really? Didn't know that! What is a word boundary in a character class then? – jeroen Aug 22 '17 at 13:28
  • @jeroen: There're no possibilities because word boundary is not a character so it can't be included in a character class. – Toto Aug 22 '17 at 13:32
  • @Toto Hmmmmm, that kind of makes sense. Too bad... – jeroen Aug 22 '17 at 14:13

1 Answers1

0

Thank you very much everyone for commenting. I apologize for pointing in the wrong direction: preg_match_all and other regex functions were not my problem for messing with the characters. A couple of things confused me (such as Windows command line not being able to output UTF8). In the end, there was only one issue in my code:

  • before writing strings to file, I used strtolower function, which reduces everything lowercase, including special characters such as í (\u00e). The solution was to use mb_strtolower instead, which is limited to alphabetic characters.

Of course, you couldn't spot this problem because I didn't include the specific code part in my question! During searching for the problem, I also added

  • header('Content-Encoding: UTF-8');
  • header('Content-type: text/csv; charset=UTF-8');
  • mb_internal_encoding("UTF-8");

to my php-script file, but this doesn't seem to have any effect on my output file. Anyway, solved my problem. Thank you!

Alex
  • 2,784
  • 2
  • 32
  • 46