0

I'm stuck with a problem. I would like to merge two text files given a specific onset time.

For example:

Text1 (in column):

30 
100 
200

Text2 (in column):

10
50
70

My output should be 1 text file (single column) like this:

10
30
50
70
100
200

I can use cat or merge to combine files, but not sure how to take care of the order for the onset time. Thank you in advance for all your help!

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
Nico
  • 1
  • 1
  • `sort -n`? Or is it not always sorting in ascending order (smallest to largest number)? Not the most efficient if already presorted, but your files probably won't be huge (gigabytes and plus) – knittl Feb 13 '23 at 13:02
  • Are both of the inputs single-column files with no other data? – Paul Hodges Feb 13 '23 at 15:17

3 Answers3

4

Like this:

sort -n file1 file2
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • 3
    Probably avoid the [useless `cat`](https://stackoverflow.com/questions/11710552/useless-use-of-cat) though; `sort` can eminently well read multiple input files. – tripleee Feb 13 '23 at 12:49
1

Most sort commands (e.g. GNU coreutils, free BSD, open BSD, mac osx, uutils) have a merge option for creating one sorted file from multiple files that are already sorted.

sort -m -n text1 text2

The only sort without such an option I could find is from busybox. But even that version tolerates an -m option, ignores it, sorts the files as usual, and therefore still gives the expected result.

I would have assumed that using -m doesn't really matter that much compared to just sorting the concatenated files like busybox does, since sorting algorithms should have optimizations for already sorted parts. However, a small test on my system with GNU coreutils 8.28 proved the contrary:

shuf -i 1-1000000000 -n 10000000 | sort -n > text1  # 10M lines, 95MB  
shuf -i 1-1000000000 -n 10000000 | sort -n > text2
time sort -m -n text1 text2 | md5sum  # real: 2.0s (used only 1 CPU core)
time sort -n text1 text2 | md5sum     # real: 4.5s (used 2 CPU cores)
Socowi
  • 25,550
  • 3
  • 32
  • 54
  • So using `-m` is slower than not using it? By a factor of 4 even (double the cores used and twice the time used). I'd expect exactly the opposite – knittl Feb 15 '23 at 10:25
  • @knittl Sorry, I mixed up the order of the comments. Of course `-m` is the faster one. I updated the answer. When in doubt, you can always try it yourself. Just copy-paste the 4 line script into your terminal and run it. – Socowi Feb 15 '23 at 10:38
0

Although you could just pipe both files to sort -n it seems inelegant not to use the fact that your input files are already sorted. If it is indeed the case that your inputs are sorted, you could do something like:

awk 'BEGIN{n = getline a < "text2"} {
    while( n && a < $1) { print a; n = getline a < "text2"} 
} 1 ' text1
William Pursell
  • 204,365
  • 48
  • 270
  • 300