I have a project that will receive data in any possible language. Right now I'm trying to parse wiki page and get the list of languages and put it into DB. Already on the parsing step I found out that most of the native names are shown with " "(empty squares and stuff) and other strange symbols. The defined charset is UTF-8.
I am not sure how this works and have no idea where to dig further. I couldn't find any information about multi language contents on websites. Should I get like a code of all the symbols to use them? How to make this work?
I need to:
- Language name in English, native and short one to be added to db;
- Display that data in any country correctly(the encode thing);
- People will be able to add data on the selected language that will also be saved in the database with a link to the language name from the previously described table.
Right now I have some problems with encoding so some text is shown incorrectly as on the image below. What I already have is here(here is only 1 line of a table from wiki):
header('Content-Type: text/html; charset=utf-8');
$html = '<table class="wikitable sortable jquery-tablesorter" id="Table">
<tbody>
<tr>
<td style="background-color:#ACE1AF;width:#ACE1AF;"></td>
<td><a href="/wiki/Northwest_Caucasian_languages" title="Northwest Caucasian languages">Northwest Caucasian</a></td>
<td><a href="/wiki/Abkhazian_language" class="mw-redirect" title="Abkhazian language">Abkhazian</a></td>
<td lang="ab" xml:lang="ab">аҧсуа бызшәа, аҧсшәа</td>
<td><span class="plainlinks"><a rel="nofollow" class="external text" href="http://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ab">ab</a></span></td>
<td>abk</td>
<td>abk</td>
<td>abk</td>
<td>also known as Abkhaz</td>
</tr>
</tbody><tfoot></tfoot></table>';
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
echo $cols->item(2)->nodeValue.' ';
echo $cols->item(3)->nodeValue.' ';
echo $cols->item(4)->nodeValue.'<br>';
echo '<hr>';
}
But, if I try to output the $html
it shows everything correctly. I use Google Chrome, last version. I need some clues and tips about how this works and how I can make my thing work properly.
Thanks for attention.