0

I used this thread to split a large text file into several smaller files. To split the file I use the following command in Git Bash:

split -l 80000 largeFile

I then want to edit each of the output files, but when I open them in VI, the output looks weird and I cannot properly edit the file. The output contains a lot of @ symbols and carets. I assume that these are control characters. See the following screenshot:

enter image description here

My questions are:

  • Why is the file displayed like this?
  • How can I properly edit the file in VI?
beta
  • 5,324
  • 15
  • 57
  • 99

2 Answers2

1

If you look closely, every second character is ^@, which inside Vim represent a null byte (cp. :help <Nul>). The letters in between are readable (USE [TIP_Update_...). So what we're looking at is a 16-bit encoding (i.e. two bytes for each character) of (mostly?) ASCII text; as the null byte is the second one, it is little endian.

The first two characters (ÿþ) break the rule; this is a byte order mark that provides text editors with a hint what the encoding is. The way it is displayed, Vim instead thinks the text is in latin1 encoding.

So, you're dealing with 16 bit UCS-2 encoded Unicode (ISO/IEC 10646-1) (name in Vim: ucs-2le; see :help encoding-values), but Vim doesn't detect them automatically.

You can either

  • manually force the encoding via :help ++enc: :e! ++enc=ucs-2le
  • reconfigure Vim (:help 'fileencodings') to automatically detect these; actually, the default value includes ucs-bom and should detect these just fine.
Ingo Karkat
  • 167,457
  • 16
  • 250
  • 324
  • thanks, this worked great for the first file of the splitted output, but for the second one it doesn't work. the result looks like this: https://pasteboard.co/HSrgkz9.png the file initially looks like this: https://pasteboard.co/HSrgyNm.png the difference is that the first two characters (ÿþ) do not appear here. can you provide a working command? – beta Dec 19 '18 at 10:44
  • That looks like the split wasn't on an even byte count, so the second one is actually _big-endian_ (with corruption at the borders). If you really need to split on _lines_, your shell also needs to have a correct understanding of the encoding. It would be easier to split on (even!) byte counts, e.g. `split -b 800000`). – Ingo Karkat Dec 19 '18 at 11:23
  • And if you supply the encoding manually, the lack of BOM doesn't matter. For `'fileencodings'`, it would complicate things, though. – Ingo Karkat Dec 19 '18 at 11:24
  • thanks. however, even when using the command `split -b 100000000 largeFile` the second file looks like this: https://pasteboard.co/HSrQzvI.png but it seems that I can make it work with the command you provided: `:e! ++enc=ucs-2le` – beta Dec 19 '18 at 12:12
  • 1
    Yes, that's what I meant; you need the manual `++enc` as the BOM is missing from subsequent splits. – Ingo Karkat Dec 19 '18 at 12:27
1

I know this is very old post but it might help someone. If you are using Git Bash to Split the file into multiple files, try creating main file in 65001 ( UTF-8 ) instead of ANSI 1252 or any other. I was facing same issues of NUL in my splitted files but when I converted my main file into UTF-8, it worked perfectly fine.

SPLIT -l 50000 Main.txt Split.txt
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103