Merge text files in a numeric order

Question

I'm stuck with a problem. I would like to merge two text files given a specific onset time.

For example:

Text1 (in column):

30 
100 
200

Text2 (in column):

10
50
70

My output should be 1 text file (single column) like this:

I can use cat or merge to combine files, but not sure how to take care of the order for the onset time. Thank you in advance for all your help!

`sort -n`? Or is it not always sorting in ascending order (smallest to largest number)? Not the most efficient if already presorted, but your files probably won't be huge (gigabytes and plus) — knittl, Feb 13 '23 at 13:02
Are both of the inputs single-column files with no other data? — Paul Hodges, Feb 13 '23 at 15:17

Gilles Quénot · Answer 1 · 2023-02-13T13:33:25.717

4

Like this:

sort -n file1 file2

edited Feb 13 '23 at 13:33

answered Feb 13 '23 at 11:52

Gilles Quénot

173,512
41
224
223

3

Probably avoid the [useless `cat`](https://stackoverflow.com/questions/11710552/useless-use-of-cat) though; `sort` can eminently well read multiple input files. – tripleee Feb 13 '23 at 12:49

Socowi · Answer 2 · 2023-02-15T10:36:36.267

Most sort commands (e.g. GNU coreutils, free BSD, open BSD, mac osx, uutils) have a merge option for creating one sorted file from multiple files that are already sorted.

sort -m -n text1 text2

The only sort without such an option I could find is from busybox. But even that version tolerates an -m option, ignores it, sorts the files as usual, and therefore still gives the expected result.

I would have assumed that using -m doesn't really matter that much compared to just sorting the concatenated files like busybox does, since sorting algorithms should have optimizations for already sorted parts. However, a small test on my system with GNU coreutils 8.28 proved the contrary:

shuf -i 1-1000000000 -n 10000000 | sort -n > text1  # 10M lines, 95MB  
shuf -i 1-1000000000 -n 10000000 | sort -n > text2
time sort -m -n text1 text2 | md5sum  # real: 2.0s (used only 1 CPU core)
time sort -n text1 text2 | md5sum     # real: 4.5s (used 2 CPU cores)

So using `-m` is slower than not using it? By a factor of 4 even (double the cores used and twice the time used). I'd expect exactly the opposite — knittl, Feb 15 '23 at 10:25
@knittl Sorry, I mixed up the order of the comments. Of course `-m` is the faster one. I updated the answer. When in doubt, you can always try it yourself. Just copy-paste the 4 line script into your terminal and run it. — Socowi, Feb 15 '23 at 10:38

score 0 · Answer 3 · answered Feb 13 '23 at 12:42

Although you could just pipe both files to sort -n it seems inelegant not to use the fact that your input files are already sorted. If it is indeed the case that your inputs are sorted, you could do something like:

awk 'BEGIN{n = getline a < "text2"} {
    while( n && a < $1) { print a; n = getline a < "text2"} 
} 1 ' text1

Merge text files in a numeric order

3 Answers3