0

I am getting my urls and titles from a post's content, but the titles no longer seem to be UTF-8 and include some funky characters such as "Â" when I echo the result. Any idea why the correct charset isn't being used? My headers do use the right metadata.

I tried some of the solutions on here, but none seems to work so I thought I'd add my code below - just in case I'm missing something.

$servername = "localhost";
$database = "xxxx";
$username = "xxxxx";
$password = "xxxx";
$conn = mysqli_connect($servername, $username, $password, $database);


$post_id = 228;

$content_post = get_post($post_id);
$content = $content_post->post_content;
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="utf-8" ?>' . $content);

$links = $doc->getElementsByTagName('a');


$counter = 0;
foreach ($links as $link){

$href = $link->getAttribute('href');
$avoid  = array('.jpg', '.png', '.gif', '.jpeg');

if ($href == str_replace($avoid, '', $href)) {

$title = $link->nodeValue;
$title = html_entity_decode($title, ENT_NOQUOTES, 'UTF-8');



$sql = "INSERT INTO wp_urls_download (title, url) VALUES ('$title', '$href')";
if (mysqli_query($conn, $sql)) {
$counter++;
echo "Entry" . $counter . ": $title" . "<br>";

} else {
echo "Error: " . $sql . "<br>" . mysqli_error($conn);
}

}

}

Updated Echo string - changed this after I initially uploaded the code. I have already tried the solutions in the other posts and was not successful.

Remco
  • 361
  • 2
  • 18
  • Because you're not setting your *database connection encoding*?! – deceze Aug 20 '18 at 11:24
  • hmm, not really. I see what you are getting at, but I am just echoing the $title value at on the screen, so the database connection does not get involved (yet) – Remco Aug 20 '18 at 11:40
  • You are echoing what where exactly? What encoding is the content in? – deceze Aug 20 '18 at 11:43
  • ah my bad, I updated my code after posting this. I have now added the updated echo code where it just echos the `$title`. I have also added `$title = html_entity_decode($title, ENT_NOQUOTES, 'UTF-8');` but no success. the original content is in utf-8. – Remco Aug 20 '18 at 11:48
  • Show `bin2hex($title)` and what you expect the title to look like. – deceze Aug 20 '18 at 11:48
  • lot's of lines - but this is one with the funky character: `5472616e737665727365204162646f6d696e6973c382c2a02854564129` – Remco Aug 20 '18 at 11:55
  • This is the current title `Transverse Abdominis (TVA)` but it should be `Transverse Abdominis (TVA)` – Remco Aug 20 '18 at 11:56
  • You will have to trace that back a bit more to find the source of that byte sequence. See https://stackoverflow.com/a/25502632/476. – deceze Aug 20 '18 at 11:59
  • OK, fair enough. I'n the meantime, I'll use `$titles = str_replace("Â","",$title);`. Thanks for your help. – Remco Aug 20 '18 at 12:07
  • I can't answer my own question as it's closed - but this worked for me `$title = utf8_decode($title);` – Remco Aug 20 '18 at 13:08
  • Do not use any form of encode/decode; 2 wrongs may appear to make a right, but really they add to the mess. – Rick James Aug 23 '18 at 18:34
  • @RickJames yeah that's true - I'll use your answer and go through it again. Thanks. – Remco Aug 23 '18 at 18:36

2 Answers2

2

Did you try to set the utf8 charset on the connection?

$conn->set_charset('utf8');

For more information: http://php.net/manual/en/mysqli.set-charset.php

Michael Tijhuis
  • 173
  • 1
  • 10
  • This didn't work for me on the load content, but setting it on the connection is a good one. Didn't think of that. Thanks. – Remco Aug 23 '18 at 18:35
  • Let me know if it works. I had before the same situation and changing the connection encoding worked well for me. – Michael Tijhuis Aug 23 '18 at 21:24
1

It seems that you have "double-encoding". What you expected was

Transverse Abdominis (TVA)

But what you have for the space before the parenthesis is a special space that probably came from Microsoft Word, then got converted to utf8 twice. In hex: A0 -> c2a0 -> c382c2a0.

Yes, the link to "utf8 all the way through" would ultimately provide the fix, but I think you need more help.

The A0 was converted from latin1 to utf8, then treating those bytes as if they were latin1 and repeating the conversion.

The connection provide the client's encoding via mysqli_obj->set_charset('utf8') (or similar).

Then the column in the table should be CHARACTER SET utf8mb4 (or utf8). Verify with SHOW CREATE TABLE. (It is probably latin1 currently.)

HTML should start with <meta charset=UTF-8>.

Trouble with UTF-8 characters; what I see is not what I stored

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Ah that makes sense. I managed to get it to work when my question was locked, but is good to understand what's happened and why it's happened. Thanks. – Remco Aug 23 '18 at 18:34
  • @Remco - I hope you did not fix it with `str_replace`; that will fix only the one case; other cases may show up with different messes. – Rick James Aug 23 '18 at 18:36
  • Oh no, I must say that I was tempted, but I realised that it would have been a risky and messy "fix" `$doc->loadHTML(''. $content);` did the trick for me. – Remco Aug 23 '18 at 18:39