im trying to use GuzzleHTTP 6 for web scraping, and so far, im not able to find a solution for messedup encoding in response body.
Lets say i want to parse the web page, which returs data in many different languages.
Client Initialization
public function __construct() {
$this->dataClient = new Client(['base_uri' => 'http://somewebsite.org/{language_code}']);
}
Using Data Client
$request = $this->dataClient->get('/endpoint/' . $data_query . '/');
$response = $request->getBody()->__toString();
$decoded = json_decode($response, true);
foreach ($decoded as $index => $data) {
$decoded[$index] = str_replace(['<option', '>', '</option>'], '', $data);
}
return $decoded;
Problems:
If text is in English, response looks almost fine, except that some of the characters are messed up
manipulation, thereâ;€™s
Instead of
manipulation, there's
If i'm trying to get data in any other languages, thats what i get (data in Russian)
Ð;Ð;°; Ð;¿;Ð;µ;Ñ;€Ð;²;Ñ;‹Ð;¹; Ð;²;Ð
Instead of
На первый взгляд
And the problem is, if you view website, its all fine and nice, but if you try to scrape it, you will face these problems.
So far, im not able to find the source of the problem, neither utf8_decode, or iconv helps me to solve the problem.
Any solutions are highly welcome!
So, here is a small update This is the parsing function:
public function processData($data_query) {
$request = $this->dataClient->get('/endpoint/' . $data_query . '/');
$response = $request->getBody()->__toString();
// echo $response; - Everything is fine, no encoding problems
// return $response; - Encoding problems
$decoded = json_decode($response, true);
// return $decoded; - Encoding problems
foreach ($decoded as $index => $data) {
$decoded[$index] = str_replace(['<option', '>', '</option>'], '', $data);
}
return $decoded; - Encoding Problems
}
Raw response headers
{
Date: [
"Wed, 08 Jun 2016 01:45:30 GMT"
],
Server: [
"Apache"
],
X-Frame-Options: [
"SAMEORIGIN"
],
Retry-After: [
"600"
],
Content-Language: [
"en-GB"
],
Vary: [
"Accept-Encoding"
],
Transfer-Encoding: [
"chunked"
],
Content-Type: [
"text/html;charset=UTF-8"
]
}