1

I want to load RTL page over UTF_8 character encoding with Java 11 HttpClient .
this is my example code :

public class HttpClientFactory {

    public static final HttpClientFactory client = new HttpClientFactory();
    private CookieManager cookieManager;

    private HttpClientFactory() {
        initializeCookieManager();
    }

    private void initializeCookieManager() {
        cookieManager = new CookieManager();
        cookieManager.setCookiePolicy(CookiePolicy.ACCEPT_ALL);
        CookieHandler.setDefault(cookieManager);
    }

    public HttpClient produceHttpClient() {
        return HttpClient.newBuilder()
                .cookieHandler(cookieManager)
                .version(HttpClient.Version.HTTP_2)
                .connectTimeout(Duration.ofSeconds(10))
                .build();
    }
}


public class TsetmcBrowser {
    private static HttpClient client;
    public static TsetmcBrowser instance = new TsetmcBrowser();

    private TsetmcBrowser() {
        client = HttpClientFactory.client.produceHttpClient();
    }

    public void testConnection() {
        System.out.println("[*] Request Load Page");
        HttpRequest request = HttpRequest.newBuilder()
                .GET()
                .uri(URI.create("http://www.tsetmc.com/Loader.aspx"))
                .setHeader("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0")
                .setHeader("Accept-Language", "en-US,en;q=0.5")
                .setHeader("Accept", "text/plain")
                .build();
        try {

            HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
            System.out.println(new String(response.body().getBytes(), StandardCharsets.UTF_8));
            System.out.println("--------------------------------------------------------------------");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}      

But return wrong response body !
response :

�t^��m�%���o��;;���0�<�������~��o_Z������5��kA�ڮ��z0����8�h��*�;�r�N���       

How can fix this problem ?

UPDATE
I check this code with other RTL pages like:
https://www.farsnews.ir
https://sepehr.irib.ir

mah454
  • 1,571
  • 15
  • 38
  • 2
    If you use a `BodyHandlers.ofString` you don't need to get de bytes. It willl proccess the input as a `String` according to the encoding specified by the server. So you you should just call `System.out.println(response.body());` – areus Jan 15 '21 at 07:49
  • yes , i do it before you say , but does not work !!! – mah454 Jan 15 '21 at 07:50
  • Make sure the console you are running your program on supports UTF-8. That's not usually the case with Windows. The String may be ok but the console doesn't show it correctly – areus Jan 15 '21 at 07:54
  • 3
    It's likely that you're getting gzip-encoded responses and you're processing them as plain text. I saw when doing this that one needed to unzip the response in code (the response handlers don't take care of it out of the box). So start by analyzing your response type and encoding headers first – ernest_k Jan 15 '21 at 08:01
  • See this : https://stackoverflow.com/questions/53502626/does-java-http-client-handle-compression – mah454 Jan 15 '21 at 08:13
  • 4
    I just tested it and @ernest_k is right. The server is returning a gzipped response, ignoring the "Accept" header. Try `HttpResponse response = client.send(request, HttpResponse.BodyHandlers.ofInputStream()); GZIPInputStream gzipInputStream = new GZIPInputStream(response.body()); String content = new String(gzipInputStream.readAllBytes(), StandardCharsets.UTF_8); System.out.println(content);` – areus Jan 15 '21 at 08:20
  • 2
    Badly written server, ignoring both `Accept` and `Accept-Encoding` headers in the request. The response is always `Content-Type: text/html; charset=utf-8` and `Content-Encoding: gzip` – Andreas Jan 15 '21 at 08:45

0 Answers0