1

im trying to use GuzzleHTTP 6 for web scraping, and so far, im not able to find a solution for messedup encoding in response body.

Lets say i want to parse the web page, which returs data in many different languages.

Client Initialization

public function __construct() {
    $this->dataClient = new Client(['base_uri' => 'http://somewebsite.org/{language_code}']);
}

Using Data Client

$request = $this->dataClient->get('/endpoint/' . $data_query . '/');
$response = $request->getBody()->__toString();
$decoded = json_decode($response, true);
foreach ($decoded as $index => $data) {
    $decoded[$index] = str_replace(['<option', '>', '</option>'], '', $data);
}
return $decoded;

Problems:

  1. If text is in English, response looks almost fine, except that some of the characters are messed up

    manipulation, thereâ;€™s

Instead of

manipulation, there's
  1. If i'm trying to get data in any other languages, thats what i get (data in Russian)

    Ð;Ð;°; Ð;¿;Ð;µ;Ñ;€Ð;²;Ñ;‹Ð;¹; Ð;²;Ð

Instead of

На первый взгляд

And the problem is, if you view website, its all fine and nice, but if you try to scrape it, you will face these problems. So far, im not able to find the source of the problem, neither utf8_decode, or iconv helps me to solve the problem.
Any solutions are highly welcome!

So, here is a small update This is the parsing function:

public function processData($data_query) {
    $request = $this->dataClient->get('/endpoint/' . $data_query . '/');
    $response = $request->getBody()->__toString();
    // echo $response; - Everything is fine, no encoding problems
    // return $response; - Encoding problems
    $decoded = json_decode($response, true);
    // return $decoded; - Encoding problems
    foreach ($decoded as $index => $data) {
        $decoded[$index] = str_replace(['<option', '>', '</option>'], '', $data);
    }
    return $decoded; - Encoding Problems
}

Raw response headers

{
    Date: [
        "Wed, 08 Jun 2016 01:45:30 GMT"
    ],
    Server: [
        "Apache"
    ],
    X-Frame-Options: [
        "SAMEORIGIN"
    ],
    Retry-After: [
        "600"
    ],
    Content-Language: [
        "en-GB"
    ],
    Vary: [
        "Accept-Encoding"
    ],
    Transfer-Encoding: [
        "chunked"
    ],
    Content-Type: [
        "text/html;charset=UTF-8"
    ]
}
Ivan Zhivolupov
  • 1,107
  • 2
  • 20
  • 39
  • Can you show what the value is of the `Content-Type` header in the response? There should (hopefully) be a `charset` parameter. What does it say? You could potentially try to add the `Accept-Charset: UTF-8` header. That should force the server to encode the data in UTF-8 if properly implemented. – Brad Frost Jun 08 '16 at 01:26
  • { Date: [ "Wed, 08 Jun 2016 01:28:18 GMT" ], Server: [ "Apache" ], X-Frame-Options: [ "SAMEORIGIN" ], Retry-After: [ "600" ], Content-Language: [ "en-GB" ], Vary: [ "Accept-Encoding" ], Transfer-Encoding: [ "chunked" ], Content-Type: [ "text/html;charset=UTF-8" ] } – Ivan Zhivolupov Jun 08 '16 at 01:28
  • How exactly do you check the response? What is the encoding of the target you're printing the response to? – zerkms Jun 08 '16 at 01:30
  • Well, thats what i'm trying to do actually. I'll update the main post with new info – Ivan Zhivolupov Jun 08 '16 at 01:33
  • So you mentioned below it appears to be fine if you echo the response. Do you mean that there are no messed up utf-8 characters? – Brad Frost Jun 08 '16 at 01:34
  • If response is echoed, everything is fine, if it is decoded/just returned its messed up – Ivan Zhivolupov Jun 08 '16 at 01:35

2 Answers2

2

I have a similar case (load a XML with Guzzle and parse with SimpleXML), but i know that the source was in ISO-8559-1 but the output from SimpleXML parsed result was scrambled. I tried lots of ways, only this one solved_:

$attribute = mb_convert_encoding((string) $attribute, 'ISO-8859-1', 'UTF-8');

The attribute is some XML node value. I simply convert from the know encoding to the one i want. Hope can help some one...

Holzhey
  • 381
  • 2
  • 8
0

Have a read of the older SO response posted here: Can Goutte/Guzzle be forced into UTF-8 mode?. Yes it mentions using utf8_decode() but also in conjunction with a fork of Guzzle. Have a look at Guzzle's issue tracker, does it have an issue that sounds similar to yours? If so, comment on it to see if the core dev's will fix it. The above SO post is 3 years old, I'd be surprised if hacks and forks were still needed if the issue were so prevalent.

Perhaps it has been fixed and you now need to ensure that the crawled page is itself sending the correct encoding headers. Note that there is an order of precedence with response headers. I believe it's the webserver that is most important, then the DOM itself, if those headers are not sent or omitted from the webserver's response. But please check this as I'm not 100%.

Community
  • 1
  • 1
theruss
  • 1,690
  • 1
  • 12
  • 18
  • Thanks for the comment! I've already read this topic, and the thing is: 1. Since the project is kind of big, i'm not allowed to make any changes in PHP HTTP Client 2. The workaround in that topic is not working with current GuzzleHTTP Version 3. Remote Server Returns text/html;charset=UTF-8 4. Local Server is set to UTF-8 Mode 5. All requests are made using UTF-8 (For some reason, CP1251 is also working in IE) 6. If you echo this response, it will be fine, but if you'll try to work with it, it will be messed up – Ivan Zhivolupov Jun 08 '16 at 01:24
  • And if i'm trying to 'guess' encoding with PHP (or any other tool tbh), im getting various encodings, but not UTF-8. But, i know it supposed to be in UTF-8. The last ones were CP850, CP851, MAC*something*, and others, so im really lost in this 'ZOO' of encodings – Ivan Zhivolupov Jun 08 '16 at 01:31
  • OK, so it prints OK but you cannot "work with it". So there's an encoding-related difference between the manner in which you print/echo and "work with it". Can you give us some detail about how you echo and how you "work with it"? Sounds more like a receiving end problem at this stage. – theruss Jun 08 '16 at 01:39
  • I've update main post, added the complete function to the end of it. Also, provided some comments – Ivan Zhivolupov Jun 08 '16 at 01:44