Utf-8 strings won't convert similary. I want all the scraped text become the same for saving in database

Question

I have huge problems with encodings. I'm scraping text from some other sites with file_get_contents(). And the quotes becomes special odd characters or questionmarks. But the strange thing is that some text from different sites ARE utf-8, but the quotes becomes different things when I receive it. When I run utf8_decode() a quote from one utf-8 text becomes a quote. Bot in another utf-8 text from another site it becomes a questionmark.

Is there any way to fix so all text is looking good when I save it to db.

The charset in database table is latin1_swedish_ci, and I have tried to change it to utf8_unicode_ci but did no difference.

Edit:

Have now tried a little bit more. These two works for different texts. This one works for one text:

$source = utf8_encode($source);

And this are working for the others:

$source = mb_convert_encoding($source, 'HTML-ENTITIES', 'utf-8');

But you can't put the string through both. They are not working together. They destroy the other ones for each other.

Printscreen without any encoding (text is in Swedish):

Edit:

FYI: I have now changed the table to utf8_unicode_ci. However, still not working. Here are all the functions I've tried with:

Actually, if I just leave it like this, most of the texts are outputted with right characters. It's just some where " becomes Â”.

take a look sounds similar: http://stackoverflow.com/questions/910793/detect-encoding-and-make-everything-utf-8 maybe this helps to get the sources clean before you decode it — swidmann, Sep 18 '15 at 14:35

score 0 · Answer 1 · answered Sep 18 '15 at 19:44

0

can you please dump the code you grabbed using print_r?

notice: your html page must have a correct meta-charset to display unicode characters correctly.

<head>
    <meta charset="UTF-8">
</head>

answered Sep 18 '15 at 19:44

GrafiCode

3,307
3
26
31

Does the page need right meta-charset even for text fields like ``? I'm printing them there. I add a printscreen in my post above now.. Everything is in Swedish. And content-type is UTF-8 yes. – Peter Westerlund Sep 18 '15 at 19:53
well yes, wherever you decide to output your data (even inside the attribute value of an inputbox), the meta charset should always be declared accordingly to the DB collation. – GrafiCode Sep 18 '15 at 20:00
I just want all the text going into DB to be the same type... Don't know what to do :/ – Peter Westerlund Sep 18 '15 at 20:05

Utf-8 strings won't convert similary. I want all the scraped text become the same for saving in database

1 Answers1