0

I'm trying to get the content of an online page through SpringFramework using this procedure

public <T>HttpReply<T> httpRequest(final String uri, final HttpMethod method,
            final Class<T> expectedReturnType, final List<HttpMessageConverter<?>> messageConverters,
            final HashMap<String, Object> formValues, final HashMap<String, Object> headers)
                    throws HttpNullUriOrMethodException, HttpInvocationException {
        try {

            redirectInfo.set(new AbstractMap.SimpleEntry<String, String>(uri, ""));

            if (method==null) {
                throw new HttpNullUriOrMethodException("HttpMethod cannot be null.");
            }

            if (!StringUtils.hasText(uri)) {
                throw new HttpNullUriOrMethodException("URI cannot be null or empty.");
            }

            HttpRequestExecutingMessageHandler handler =
                    buildMessageHandler(uri, method, expectedReturnType, messageConverters);

            // Default queue for reply
            QueueChannel replyChannel = new QueueChannel();
            handler.setOutputChannel(replyChannel);

            // Exec Http Request
            Message<?> message = buildMessage(formValues, headers);
            try {
                handler.handleMessage(message);
            }
            catch (Exception e) {
                throw new HttpInvocationException("Error Handling HTTP Message.");
            }

            // Get Response
            Message<?> response = replyChannel.receive();
            if (response == null) {
                throw new HttpInvocationException("Error: communication is interrupted.");
            }

            // Read response Headers
            String[] usefulHeaders = readUsefulHeaders(response.getHeaders());

            // Return payload
            Object respObj = response.getPayload();             

            if (expectedReturnType != null && !expectedReturnType.isInstance(respObj)) {
                throw new HttpInvocationException("Error: response payload is instance of "
                         + respObj.getClass().getName() + ". Expected: " + expectedReturnType.getClass().getName());
            }

            HttpReply<T> retVal = new HttpReply<>();
            retVal.setPayload((T)respObj);

            String valRedirect = uri;
            if (redirectInfo.get().getKey().equals(uri)) {
                if (StringUtils.hasText(redirectInfo.get().getValue())) {
                    valRedirect = redirectInfo.get().getValue();
                }
            }
            else {
                throw new HttpInvocationException("ERROR READING REDIRECT INFORMATION!!! Original URI: "
                        + uri + " - FOUND URI: " + redirectInfo.get().getKey());
            }
            retVal.setActualLocation(valRedirect);
            return retVal;
        }
        finally {
            redirectInfo.remove();
        }
    }

which gets called like this

HttpReply<byte[]> feedContent = httpUtil.httpRequest(rssFeed.getUrl(), HttpMethod.GET, byte[].class, null,
                null, null);

rawXml = new String(feedContent.getPayload());

Now, this procedure works fine, except that sometimes rawXml contains �, especially when reading from page with a charset different from UTF8.

I tried to put into the handler.setCharset(StandardCharsets.ISO_8859_1), or to change the message header so that it would contain "contentType=application/xml; charset=ISO-8859-1"

I also tried to convert the text once inside rawXml but sometimes the message is neither UTF-8 nor ISO-8859-1 and so the conversion just doesn't correct the missing characters.

Malignus
  • 115
  • 1
  • 13
  • 1
    Welcome to the world of character sets. They are terribly difficult to get right. Hopefully the server tells you what character set it is using and then you use that character set to read your xml. However, not all character sets play nicely with the expected character ranges. One that is notorious for doing this is the microsoft character set that used control characters, which caused all sorts of problems. I would recommend looking for those control characters in your returned xml and replace them with a suitable replacement. http://www.alanwood.net/demos/ansi.html – hooknc Dec 28 '22 at 16:47
  • the characters that get changed into � are the classic àòèéìù, “”, «» and SOMETIMES '. The problem comes from the fact that the retVal is an array of bytes that already gets populated with the wrong character. – Malignus Jan 02 '23 at 14:31
  • So, I am not totally sure what your question/goal is... Are you trying to convert the square question mark into the correct characters? Do you want to reject any xml that isn't utf-8? Do you want to accept any character set and then change it into utf-8 as part of the response? My guess is that you want to convert the characters, but that can sometimes be difficult. First and foremost, I would urge you to figure out the `code points` of the characters that are messed up, then determine what those values should actually be. https://stackoverflow.com/q/23979676/42962 – hooknc Jan 03 '23 at 17:23

0 Answers0