VI shows control characters after splitting a large text file with git bash split

Question

I used this thread to split a large text file into several smaller files. To split the file I use the following command in Git Bash:

split -l 80000 largeFile

I then want to edit each of the output files, but when I open them in VI, the output looks weird and I cannot properly edit the file. The output contains a lot of @ symbols and carets. I assume that these are control characters. See the following screenshot:

My questions are:

Why is the file displayed like this?
How can I properly edit the file in VI?

score 1 · Accepted Answer · answered Dec 19 '18 at 10:09

1

If you look closely, every second character is ^@, which inside Vim represent a null byte (cp. :help <Nul>). The letters in between are readable (USE [TIP_Update_...). So what we're looking at is a 16-bit encoding (i.e. two bytes for each character) of (mostly?) ASCII text; as the null byte is the second one, it is little endian.

The first two characters (ÿþ) break the rule; this is a byte order mark that provides text editors with a hint what the encoding is. The way it is displayed, Vim instead thinks the text is in latin1 encoding.

So, you're dealing with 16 bit UCS-2 encoded Unicode (ISO/IEC 10646-1) (name in Vim: ucs-2le; see :help encoding-values), but Vim doesn't detect them automatically.

You can either

manually force the encoding via :help ++enc: :e! ++enc=ucs-2le
reconfigure Vim (:help 'fileencodings') to automatically detect these; actually, the default value includes ucs-bom and should detect these just fine.

answered Dec 19 '18 at 10:09

Ingo Karkat

167,457
16
250
324

thanks, this worked great for the first file of the splitted output, but for the second one it doesn't work. the result looks like this: https://pasteboard.co/HSrgkz9.png the file initially looks like this: https://pasteboard.co/HSrgyNm.png the difference is that the first two characters (ÿþ) do not appear here. can you provide a working command? – beta Dec 19 '18 at 10:44
That looks like the split wasn't on an even byte count, so the second one is actually _big-endian_ (with corruption at the borders). If you really need to split on _lines_, your shell also needs to have a correct understanding of the encoding. It would be easier to split on (even!) byte counts, e.g. `split -b 800000`). – Ingo Karkat Dec 19 '18 at 11:23
And if you supply the encoding manually, the lack of BOM doesn't matter. For `'fileencodings'`, it would complicate things, though. – Ingo Karkat Dec 19 '18 at 11:24
thanks. however, even when using the command `split -b 100000000 largeFile` the second file looks like this: https://pasteboard.co/HSrQzvI.png but it seems that I can make it work with the command you provided: `:e! ++enc=ucs-2le` – beta Dec 19 '18 at 12:12
1

Yes, that's what I meant; you need the manual `++enc` as the BOM is missing from subsequent splits. – Ingo Karkat Dec 19 '18 at 12:27

score 1 · Answer 2 · edited Nov 20 '20 at 10:01

1

I know this is very old post but it might help someone. If you are using Git Bash to Split the file into multiple files, try creating main file in 65001 ( UTF-8 ) instead of ANSI 1252 or any other. I was facing same issues of NUL in my splitted files but when I converted my main file into UTF-8, it worked perfectly fine.

SPLIT -l 50000 Main.txt Split.txt

edited Nov 20 '20 at 10:01

Suraj Rao

29,388
11
94
103

answered Nov 20 '20 at 09:56

Abhishek Prasher

11
1

VI shows control characters after splitting a large text file with git bash split

2 Answers2