2

Prepping a Curl Response for particular data to be inserted into a MySQL Table.

Noticed some special characters in the saved data for certain URL's.

$curldata = curl_exec($curl);
$encoding = mb_detect_encoding($curldata);

brought back ASCII encoding.

Okay, don't want that.

The tables in my database are an InnoDB type with a utf8mb4_unicode_ci collation.

Added this to my curl options:

curl_setopt($curl, CURLOPT_ENCODING, 1);

And an iconv function based on the above mb_detect_encoding / $encoding variable upon save.

$curldata = iconv($encoding, "UTF-8", $curldata);

// save to file to test output
file_put_contents('test.html', $curldata);

Not sure if this is the best way to go about this, but my test.html output no longer has any encoding for special characters, so... (perhaps) mission accomplished.

As I parse through the data, I then notice this character.

Not an ordinary comma... [Comparison: ,/,]

But acts like one. Try doing a ctrl+f and try to find a comma. It treats them as the same, and both as a UTF-8 character - var_dump(mb_detect_encoding(','));

I look at my table row, and see it as a row inserted as such

8,8

If I try to search for a , it does indeed bring back the instances where is present.

Vice versa, if I search for it brings back all instances where that and a comma occurs.

Basically for all intents and purposes it is a comma, yet obviously isn't.

This is of course workable, but rather annoying and feels riddled with inconsistency.

Can anyone explain why the two commas are the same, yet obviously different?

Is there a solution for me to prevent these odd characters from entering my CURL response, or further in within my DOM response and PDO Insert.

edit:

If relevant,

// dom
$dom = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML(mb_convert_encoding($curldata, 'HTML-ENTITIES', 'UTF-8'));

// pdo
$pdoquery = "INSERT INTO `table` (`Attr`) VALUES (?)";
$value = "8,8";
$stmt = $pdo->prepare("INSERT INTO `table` (`Attr`) VALUES (?)");
$stmt->execute([$value]);

edit 2:

Well, it appears to be a FULLWIDTH COMMA..

var_dump(utf8_to_unicode(','));

string '%uff0c' (length=6)

var_dump(utf8_to_unicode(','));

string '%2c' (length=3)

Starting to make more sense... now to figure out how to prevent such characters from entering the curl response/DOM/database...

Brian Bruman
  • 883
  • 12
  • 28
  • I agree, but that is seemingly only how it appears. The browser, MySQL, PHP, so forth, all claim it to be the same character (a simple comma) [image link](https://i.imgur.com/E9V6acr.png) – Brian Bruman Mar 07 '19 at 04:07
  • https://www.online-toolz.com/tools/text-unicode-entities-convertor.php says `,,` is `%uFF0C%2C`, i.e. `%uFF0C` and a simple comma. – Pinke Helga Mar 07 '19 at 04:12
  • Yeah you got me thinking the same. Made a second edit... based off this answer: https://stackoverflow.com/questions/395832/how-to-get-code-point-number-for-a-given-character-in-a-utf-8-string ... confusing how the browser interprets it as the same character though. Now trying to figure out how to prevent non-standard UTF-8 characters in my responses – Brian Bruman Mar 07 '19 at 04:16
  • The curl requested site is also controlled by yourself? Or do you just try to replace some characters from a foreign response? – Pinke Helga Mar 07 '19 at 04:22
  • No, it's an external site. Not exactly sure the ideal solution but I'm looking for maybe a function that will 'normalize' the string to standard utf-8 characters before my PDO/MySQL insert... hm – Brian Bruman Mar 07 '19 at 04:25
  • Isn't there any curl response header providing the actual encoding? Or a `` tag? – Pinke Helga Mar 07 '19 at 04:29
  • `` on the page and `Content-Type: text/html; charset=utf-8` in the headers. – Brian Bruman Mar 07 '19 at 04:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/189568/discussion-between-quasimodos-clone-and-brian-bruman). – Pinke Helga Mar 07 '19 at 04:39

2 Answers2

1

You might want the function mb_convert_kana which can convert characters of different widths into a uniform width.

$s = 'This is a string with ,, (commas having different widths)';

echo 'original : ', $s, PHP_EOL;
echo 'converted: ', mb_convert_kana($s, 'a');

result:

original : This is a string with ,, (commas having different widths)
converted: This is a string with ,, (commas having different widths)

PHP documentation: mb_convert_kana
To get an idea what the meaning is, see also http://unicode.org/reports/tr11-2/

By convention, 1/2 Em wide characters of East Asian legacy encodings are called "half-width" (or hankaku characters in Japanese), the others are called correspondingly "full-width" (or zenkaku) characters.

Pinke Helga
  • 6,378
  • 2
  • 22
  • 42
0

With a suitable COLLATION, the two commas are treated as equal:

mysql> SELECT ',' = ',' COLLATE utf8mb4_general_ci;
+----------------------------------------+
| ',' = ',' COLLATE utf8mb4_general_ci  |
+----------------------------------------+
|                                      0 |
+----------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT ',' = ',' COLLATE utf8mb4_unicode_ci;
+----------------------------------------+
| ',' = ',' COLLATE utf8mb4_unicode_ci  |
+----------------------------------------+
|                                      1 |
+----------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT ',' = ',' COLLATE utf8mb4_unicode_520_ci;
+--------------------------------------------+
| ',' = ',' COLLATE utf8mb4_unicode_520_ci  |
+--------------------------------------------+
|                                          1 |
+--------------------------------------------+
1 row in set (0.00 sec)

It would be better to talk in terms of HEX, not unicode:

mysql> SELECT HEX(','), HEX(',');
+------------+----------+
| HEX(',')  | HEX(',') |
+------------+----------+
| EFBC8C     | 2C       |
+------------+----------+
1 row in set (0.00 sec)
Rick James
  • 135,179
  • 13
  • 127
  • 222