1

I have a project that will receive data in any possible language. Right now I'm trying to parse wiki page and get the list of languages and put it into DB. Already on the parsing step I found out that most of the native names are shown with " "(empty squares and stuff) and other strange symbols. The defined charset is UTF-8.

I am not sure how this works and have no idea where to dig further. I couldn't find any information about multi language contents on websites. Should I get like a code of all the symbols to use them? How to make this work?

I need to:

  • Language name in English, native and short one to be added to db;
  • Display that data in any country correctly(the encode thing);
  • People will be able to add data on the selected language that will also be saved in the database with a link to the language name from the previously described table.

Right now I have some problems with encoding so some text is shown incorrectly as on the image below. What I already have is here(here is only 1 line of a table from wiki):

header('Content-Type: text/html; charset=utf-8');

$html = '<table class="wikitable sortable jquery-tablesorter" id="Table">
<tbody>
<tr>
<td style="background-color:#ACE1AF;width:#ACE1AF;"></td>
<td><a href="/wiki/Northwest_Caucasian_languages" title="Northwest Caucasian languages">Northwest Caucasian</a></td>
<td><a href="/wiki/Abkhazian_language" class="mw-redirect" title="Abkhazian language">Abkhazian</a></td>
<td lang="ab" xml:lang="ab">аҧсуа бызшәа, аҧсшәа</td>
<td><span class="plainlinks"><a rel="nofollow" class="external text" href="http://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ab">ab</a></span></td>
<td>abk</td>
<td>abk</td>
<td>abk</td>
<td>also known as Abkhaz</td>
</tr>
</tbody><tfoot></tfoot></table>';

$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
    $cols = $row->getElementsByTagName('td');
    echo $cols->item(2)->nodeValue.' ';
    echo $cols->item(3)->nodeValue.' ';
    echo $cols->item(4)->nodeValue.'<br>';
    echo '<hr>';
}

The output looks like this: enter image description here

But, if I try to output the $html it shows everything correctly. I use Google Chrome, last version. I need some clues and tips about how this works and how I can make my thing work properly.

Thanks for attention.

Telion
  • 727
  • 2
  • 10
  • 22

2 Answers2

1

Change the Database, Tables And Columns Collation to utf8mb4_unicode_520_ci, Also keep in mind the the Max UNIQUE VARCHAR Length is 191.

As i know PHPMyAdmin sets the collation to latin1_swedish_ci as default,

But this collation isn't recommend for multiple languages websites,

UTF8 is made for this reason,

Also at the end of the name ci here means Case Insensitive

Axon
  • 439
  • 1
  • 4
  • 13
  • Thx for the reply. But I have this problem on PHP side. When I "echo" content to html there are squares. Is there any way to make it show correctly on any computer? Save data to database is my next step of suffering :) – Telion Jul 31 '17 at 18:30
  • @Telion If possible, Please put the code that contains the problem And what browser do you use to preview the code, That if the collation is already set to `UTF8`. – Axon Jul 31 '17 at 18:31
  • Not a problem. Give me some time. – Telion Jul 31 '17 at 18:43
  • Ok, there it is. – Telion Jul 31 '17 at 18:54
  • (actually, my PHPmyadmin makes default collation `utf8_general_ci` so everything worked by default) – Telion Aug 04 '17 at 14:09
1

I think that DOMDocument component can not work correctly with chars not from latin 1 charset.

Change line $dom->loadHTML($html); to

$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

This should help.

More info in the related answer

Artur Babyuk
  • 288
  • 2
  • 8