Explanation: Line terminators
As explained by various answers to this question, the standard line terminators differ between OS's:
- Linux uses
LF
(line feed, 0x0a
)
- Windows uses
CRLF
(carriage return and line feed 0x0d 0x0a
)
- Mac, pre OS X used
CR
(carriage return CR
)
To solve your problem, it would be important to figure out what line terminators your LargeFile.txt uses. The simplest way would be the file
command:
file LargeFile.txt
The output will indicate if line terminators are CR
or CRLF
and otherwise just state that it is an ASCII file.
Since LF
and CRLF
line terminators will be recognized properly in Linux and lines should not appear merged together (no matter which way you use to view the file) unless you configure an editor specifically so that they do, I will assume that your file has CR
line terminators.
Example solution to your problem (assuming CR
line terminators)
If you want to split the file in the shell and with shell commands, you will potentially face the problem that the likes of cat
, split
, awk
, etc will not recognize line endings in the first place. If your file is very large, this may additionally lead to memory issues (?).
Therefore, the best way to handle this may be to translate the line terminators first (using the tr
command) so that they are understood in Linux (i.e. to LF
) and then apply your split
or awk
code before translating the line terminators back (if you believe you need to do this).
cat LargeFile.txt | tr "\r" "\n" > temporary_file.txt
split -l 5000 temporary_file.txt SmallFile
rm temporary_file.txt
for file in `ls SmallFile*`; do filex=$file.txt; cat $file | tr "\n" "\r" > $filex; rm $file; done
Note that the last line is actually a for loop:
for file in `ls SmallFile*`
do
filex=$file.txt
cat $file | tr "\n" "\r" > $filex
rm $file
done
This loop will again use tr
to restore the CR
line terminators and additionally give the resulting files a txt
filename ending.
Some Remarks
Of course, if you would like to keep the LF
line terminators you should not execute this line.
And finally, if you find that you have a different type of line terminators, you may need to adapt the tr
command in the first line.
Both tr
and split
(and also cat
and rm
) are part of GNU coreutils and should be installed on your system unless you are in a very untypical environment (a rescue shell of an initial RAM disk perhaps). The same (should typically be available) goes for the file
command, this one.