how to append multiple files using multithread in bash

Question

how can we append multiple files into a single file using multi threading, each of my file has 10M of row. so i want to process all files at the same time?

 #!/bin/bash
appendFiles  A.TXT &
appendFiles  B.TXT &
appendFiles  C.TXT &
wait

function appendFiles 
 {
while  read -r line; do
echo $line >>final.txt
done < $1
}

Does the order of records matter in the combined file? Meaning can the records of all 3 files be mixed up? — Guru, Apr 25 '19 at 11:49

score 5 · Answer 1 · edited Apr 25 '19 at 13:23

5

Have you tried do use a simple cat like this:

cat A.txt B.txt C.txt > final.txt

It's way faster than reading each file line by line, even when it's done in parallel.

Also you could try parallel cat too, but for my tests it wasn't faster than doing it in one command. (Tested with three file around 10M rows)

#!/bin/bash
appendFiles  A.TXT &
appendFiles  B.TXT &
appendFiles  C.TXT &
wait

function appendFiles 
{
   cat "$1" >> final.txt
}

edited Apr 25 '19 at 13:23

tripleee

175,061
34
275
318

answered Apr 25 '19 at 13:16

Florian Schlag

639
4
16

The last solution risks half-line mixing. Typically a bad idea. – Ole Tange Apr 26 '19 at 12:16

tripleee · Answer 2 · 2019-04-25T15:08:53.937

I would leave comments, but there are just so many things which are wrong with this. Pardon me if this sounds harsh; this is a common enough misconception that I want to be terse and to the point rather than polite.

As a basic terminology fix, there is no threading here. There are two distinct models of concurrency and Bash only supports one of them, namely multiprocessing. Threading happens inside of a single process; but there is no way in Bash to manage the internals of other processes (and this would be quite problematic indeed, anyway). Bash can start and stop processes (not threads), and does that very well.

But adding CPU concurrency in an effort to speed up tasks which are not CPU bound is a completely flawed idea. The reason I/O takes time is that your disk is slow. Your CPU sits idle for the vast majority of the time while your spinning disk (or even SSD) fills and empties DMA buffers at speeds which are glacial from the CPU's perspective.

In fact, adding more processes to compete for limited I/O capacity is likely to make things slower, not faster; because the I/O channel will be directed to try to do many things at once, where maintaining locality would be better (don't move the disk head between unrelated files because you will have to move back a few milliseconds from now; or similarly for an SSD, though with much less crucial effects, streaming a contiguous region of memory will be more efficient than scattered random access).

Adding to this, your buggy reimplementation of cat is going to be horribly slow. Bash is notorious for being very inefficient in while read loops. (The main bug is the quoting but there are corner cases with read you want to avoid, too.)

Moreover, you are opening the file, seeking to the end of the file for appending, and closing it again each time through the loop. You can avoid this by moving the redirection outside the loop;

while IFS= read -r line || [[ -n $line ]]; do
    printf '%s\n' "$line"
done >>final.txt

But this still suffers from the inherent excruciating slowness of while read. If you really want to combine these files, I would simply cat them all serially.

cat A.TXT B.TXT C.TXT >final.txt

If I/O performance is really a concern, combining many text files into a single text file is probably a step in the wrong direction, though. For information you need to read more than once, reading it into a database is a common way to speed it up. Initializing and indexing the database adds some overhead up front, but this is quickly paid back when you can iterate over the fields and records much more quickly and conveniently than when you have them in a sequential file.

score 0 · Answer 3 · answered Apr 26 '19 at 12:29

Normally disks perform best if it does sequential reads. That is why this is typically the best solution if you have a single disk:

cat file1 file2 file3 > file.all

But if your disk is a distributed networking file system, or a RAID system, then things may perform radically different. In that case you may get a performance boost by reading files in parallel.

The most obvious solution, however, is bad:

(cat file1 & cat file2 & cat file3 &) > file.all

This is because you risk getting the first half of a line from file1 mixed with the last half of a line from file2.

If you instead use parcat (part of GNU Parallel), then you will not see this mixing because it is designed to guard against that:

parcat file1 file2 file3 > file.all

or (slower, but essentially the same):

parallel --line-buffer -j0 cat ::: file1 file2 file3 > file.all

how to append multiple files using multithread in bash

3 Answers3