0

I am new to linux (not my own server) and I want to split some windows txt files by calling a bash script from a third party application:

So far I have it working in two ways up to a point:

split -l 5000 LargeFile.txt SmallFile

for file in LargeFile.*
do
    mv "$file" "$file.txt"
done

awk '{filename = "wrd." int((NR-1)/5000) ".txt"; print >> filename}' LargeFile.txt

But both give me txt files with the result:

line1line2line3line4

I found some topics about putting LargeFile.txt like this $ (LargeFile.txt) but it is not working for me. (Also I found a swich to let the split command produce txt files directly, but this is also not working)

I hope some one can help me out on this one.

iamcj
  • 27
  • 5

1 Answers1

0

Explanation: Line terminators

As explained by various answers to this question, the standard line terminators differ between OS's:

  • Linux uses LF (line feed, 0x0a)
  • Windows uses CRLF (carriage return and line feed 0x0d 0x0a)
  • Mac, pre OS X used CR (carriage return CR)

To solve your problem, it would be important to figure out what line terminators your LargeFile.txt uses. The simplest way would be the file command:

file LargeFile.txt

The output will indicate if line terminators are CR or CRLF and otherwise just state that it is an ASCII file.

Since LF and CRLF line terminators will be recognized properly in Linux and lines should not appear merged together (no matter which way you use to view the file) unless you configure an editor specifically so that they do, I will assume that your file has CR line terminators.

Example solution to your problem (assuming CR line terminators)

If you want to split the file in the shell and with shell commands, you will potentially face the problem that the likes of cat, split, awk, etc will not recognize line endings in the first place. If your file is very large, this may additionally lead to memory issues (?).

Therefore, the best way to handle this may be to translate the line terminators first (using the tr command) so that they are understood in Linux (i.e. to LF) and then apply your split or awk code before translating the line terminators back (if you believe you need to do this).

cat LargeFile.txt | tr "\r" "\n" > temporary_file.txt
split -l 5000 temporary_file.txt SmallFile
rm temporary_file.txt
for file in `ls SmallFile*`; do filex=$file.txt; cat $file | tr "\n" "\r" > $filex; rm $file; done

Note that the last line is actually a for loop:

for file in `ls SmallFile*` 
do 
    filex=$file.txt 
    cat $file | tr "\n" "\r" > $filex
    rm $file
done

This loop will again use tr to restore the CR line terminators and additionally give the resulting files a txt filename ending.

Some Remarks

Of course, if you would like to keep the LF line terminators you should not execute this line.

And finally, if you find that you have a different type of line terminators, you may need to adapt the tr command in the first line.

Both tr and split (and also cat and rm) are part of GNU coreutils and should be installed on your system unless you are in a very untypical environment (a rescue shell of an initial RAM disk perhaps). The same (should typically be available) goes for the file command, this one.

0range
  • 2,088
  • 1
  • 24
  • 32