11

I have an application which listens to the external feed on hourly basis and receives the feed JSON which is a chunked transfer encoding stream, the listener to the feed write the chunk to the file, after the whole stream is completed another thread parses the file and extracts the data. But now while writing the file the data is written in binary format even though I have specified the charset while writing.

    public void writeToFile(InputStream in){
     File feedFile = new File("/tmp/feed.json");
    try {
        FileUtils.touch(feedFile);
        StringWriter writer = new StringWriter();
        IOUtils.copy(in, writer, StandardCharsets.UTF_8);
        FileUtils.write(feedFile, writer.toString(), StandardCharsets.UTF_8,true);

    } catch (IOException e) {
        logger.error(Constants.FAILED_TO_WRITE_FEED_INTO_FILE,e);
    }
}

This code works fine on windows and linux box, but while inside docker container its written in binary format.

Docker container used Centos7

Brajesh Pant
  • 311
  • 1
  • 6
  • 21
  • What do you mean with ```binary format```? Are the umlauts and other non-ascii stuff just scrambled? – bratkartoffel Jun 04 '18 at 12:51
  • See if @PaulRey answer helps. If not then try using `ENV JAVA_TOOL_OPTIONS -Dfile.encoding=UTF8` in your `Dockerfile` and see if it changes anything – Tarun Lalwani Jun 04 '18 at 12:57
  • @bratkartoffel here is what i mean `[root@f9d5003f866d tmp]# file -i feed.json feed.json: application/octet-stream; charset=binary` you can see the charset is binary – Brajesh Pant Jun 04 '18 at 13:26
  • @TarunLalwani I have printed the Current Encoding `{ "Default Charset: ": "UTF-8", "Default Encoding:": "UTF8", "Default Locale: ": "en_US", "file.encoding; ": "UTF-8", "sun.jnu.encoding:": "UTF-8" }` – Brajesh Pant Jun 04 '18 at 13:27
  • @BrajeshPant have you actually looked in it? – bratkartoffel Jun 04 '18 at 13:29
  • Please don't trust file command inside docker. Copy the file outside the container and then check – Tarun Lalwani Jun 04 '18 at 13:31
  • @bratkartoffel i have tried the ways suggested here , also researched the net for UTF-8 encoding issue but i am not sure why the string encoding is getting changed – Brajesh Pant Jun 04 '18 at 13:31
  • @TarunLalwani i Just copied , its still showing the same `file -i feed.json feed.json: application/octet-stream; charset=binary` – Brajesh Pant Jun 04 '18 at 13:33
  • @BrajeshPant: I think you haven't understood what I wanted. Did you try to ```cat``` the file or open it in an hex-editor? What's actually wrong within the file? – bratkartoffel Jun 04 '18 at 13:44
  • @bratkartoffel `�i��+X���;�e")�\q:^m L0/�4�n���6�B�v�A�~���m�M���1/N�.ET���T ig8Gc�P���B��I�H�{��6��ӘN+�K_��ɂ�Z�H�Lc�';܃�v��3Q:�%i�ix�c�hR8�zl6����H�A(8<��Z�2P��Q&��j 12I���e��\���Ci@bnO�����#�>�ϫ��棧�Y�25 2<��v �`�͈�8kd�g�1�y����ͨ֊,C���iv:.�~�yk8��"�5E>;��5y�Qפ�����98��@zS�)���Q�▒��܀GRA�]�Y�;�WU�Ԁ��P���⃼h<�4,�y:�o�� ��'����` this is what i get in file after i do cat, this creates an issue while parsing this file via programatically – Brajesh Pant Jun 04 '18 at 13:49
  • Have you also tried to debug the application? I think that the problem is, that the data coming from the InputStream is already broken. See https://stackoverflow.com/questions/46662125/remote-debugging-java-9-in-a-docker-container-from-intellij-idea – bratkartoffel Jun 04 '18 at 13:52
  • please provide a minimal repo to debug and reproduce this issue – Tarun Lalwani Jun 04 '18 at 16:50
  • 1
    I suspect that you're retrieving a compressed feed. – teppic Jun 11 '18 at 04:36
  • Did you find a solution? – Paul Rey Aug 02 '18 at 08:09

2 Answers2

1

Maybe the locale UTF-8 in the container doesn't exist?

You can see the current locale in your running container with cat /etc/locale.conf

If it's not LANG=en_US.utf8, you can follow the instruction from this StackOverflow post by user2915097:

# Set the locale
RUN sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen && \
    locale-gen
ENV LANG en_US.UTF-8  
ENV LANGUAGE en_US:en  
ENV LC_ALL en_US.UTF-8

Source: How to set the locale inside a Ubuntu Docker container? https://stackoverflow.com/a/28406007/3756843

EDIT 1:

You should use InputStreamReader instead of InputStream because:

  • InputStream is made to handle binary data
  • InputStreamReader is made to handle text

You can find more information here.

Paul Rey
  • 1,270
  • 1
  • 15
  • 26
0

You can try this in your dockerfile

ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

It follows the same idea of the other comments you got, but uses dockers own mechanism.

Alim Özdemir
  • 2,396
  • 1
  • 24
  • 35