bash 'fold' screws up encoding in emacs

Question

Reading lines from a 'somefile' and writing them to 'sample.org' file.

echo "$line" 1>>sample.org gives correct result, which is 'Субъективная оценка (от 1 до 5): 4 - отличный, понятный и богатый вкусом ..' (russian letters)

echo "$line" | fold -w 160 1>>sample.org gives this, which is technically correct if you copypaste it anywhere outside emacs. But still. Why using fold results in my emacs displaying 'sample.org' buffer in 'RAW-TEXT' instead of 'UTF-8'

To reproduce it create 2 files in same directory - test.sh, which will contain

cat 'test.org' |
  while read -r line; do
    # echo "$line" 1>'newfile.org' # works fine
    # line below writes those weird chars to the output file
    echo "$line" | fold -w 160 1>'newfile.org'
  done

and test.org file, which will contain just 'Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание ГАМК 200мг/100г.'

Run the script with bash text.sh and hopefully you will see the problem in the output file newfile.org

The image looks like the first line is raw, while the other two are in some unspecified 8-bit (?) encoding. I don't think we can tell what happened here without access to the input file and whatever else you can supply to provide us with a proper [mre]. — tripleee, Aug 24 '21 at 18:00
Thanks for the update. (That's a [useless `cat`](https://stackoverflow.com/questions/11710552/useless-use-of-cat), though.) — tripleee, Aug 25 '21 at 09:48

score 0 · Answer 1 · answered Aug 24 '21 at 15:35

0

I'm not sure where that images comes from, however fold and coreutils in general, as well as huge number of other common cli utils, can only be safely used with inputs consisting of symbols from Posix Portable Character Set and not with multibyte UTF-8, regardless of what bullshit websites such as utf8everywhere.org state. fold suffers from the common problem - it assumes that each symbol occupies just a singe char causing multibyte UTF-8 input to be corrupted when it splits the lines.

answered Aug 24 '21 at 15:35

user7860670

35,849
4
58
84

2

For me, `fold` under a UTF-8 locale works just fine. – choroba Aug 24 '21 at 15:39
@choroba For me it does not. At least on Ubuntu 18.04. And even if it worked it is wrong to assume that something that depends on environment will always work without explicitly checking on each invocation whether environment is in appropriate state. – user7860670 Aug 24 '21 at 16:04
@user7860670, the images come from emacs buffer, which i open after the output has been written there. Buffer is displayed in RAW-TEXT instead of UTF-8. But i got the reason why it's doing it, yes. – sad Aug 30 '21 at 12:36

tripleee · Accepted Answer · 2021-08-26T17:23:02.420

I can't repro this on MacOS, but in an Ubuntu Docker image, it happens because fold inserts a newline in the middle of a UTF-8 multibyte sequence.

root@ef177a152b15:/# cat test.org 
Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание ГАМК 200мг/100г.
root@ef177a152b15:/# fold -w 160 test.org >newfile.org
root@ef177a152b15:/# cat newfile.org 
Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание Г?
?МК 200мг/100г.
root@ef177a152b15:/# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"

(Perhaps also notice that your demo script can be reduced to a one-liner.)

I would have thought that GNU fold is locale-aware, but that you have to configure a UTF-8 locale for the support to be active; but that changes nothing for me.

root@ef177a152b15:/# locale -a
C
C.UTF-8
POSIX
root@ef177a152b15:/# LC_ALL=C.UTF-8 fold -w 160 test.org 
Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание Г?
?МК 200мг/100г.

Under these circumstances, the best I can offer is to replace fold with a simple replacement.

#!/usr/bin/python3

from sys import argv

maxlen = int(argv.pop(1))

for file in argv[1:]:
    with open(file) as lines:
        for line in lines:
            while len(line) > maxlen:
                print(line[0:maxlen])
                line = line[maxlen:]
            print(line, end='')

For simplicity, this doesn't have any option processing; just pass in the maximum length as the first argument.

(Python 3 uses UTF-8 throughout on any sane platform. Unfortunately, that excludes Windows; but I am restating the obvious.)

Bash, of course, is entirely innocent here; the shell does not control external utilities like fold. (But not much help, either; echo "${tekst:48:64}" produces similar mojibake.)

I tried `locale-gen ru_RU.UTF-8` and activating that locale, but that didn't help at all, either. I was hoping Awk might be UTF-8 clean, but alas, no. — tripleee, Aug 26 '21 at 17:18

bash 'fold' screws up encoding in emacs

2 Answers2