3

As regards integer adding one-liners, several proposed shell scripting solutions exist;
however, on closer look at each of the solutions chosen, there are inherent limitations:

  • awk ones would choke at arbitrary precision and integer size (it behaves C-like, afterall)
  • bc ones would rather be unhappy with arbitrarily long inputs: (sed 's/$/+\\/g';echo 0)|bc

Understanding that there may be issues of portability on top of that across platforms (see [1] [2]) which is undesired, is there a generic solution which is a winner on both practicality and brevity?

Hint: SunOS & MacOSX are examples where portability would be an issue.
fi. could dc command permit to handle arbitrarily large 2^n, integer or otherwise, inputs?

[1] awk: https://stackoverflow.com/a/450821/1574494 or https://stackoverflow.com/a/25245025/1574494 or Printing long integers in awk

[2] bc: Bash command to sum a column of numbers

fgeorgatos
  • 171
  • 1
  • 10
  • At what point does `bc` complain about long input? – rici Jun 27 '17 at 01:36
  • If you use later versions of `gawk` arbitrary precision is supported. – dawg Jun 27 '17 at 03:12
  • Please clarify: (1) What is a **unix** one-liner? Unix is not a programming language. (2) What do you mean by *arbitrary precision*? The term means usually *exact fractional arithmetic*, but it is not clear from your posting, whether you mean it too. (3) How is the input provided? And: Is it two numbers or a list of numbers? – user1934428 Jun 27 '17 at 05:42
  • OP here: question rewritten for clarity, providing references of where `awk` and `bc` might fail. I hope it now clarifies the shortcomings of either. – fgeorgatos Jun 27 '17 at 08:30
  • How then about Ruby? For instance `ruby -e "p 777777710000000000000000000000000000000001+3"` prints *777777710000000000000000000000000000000004* to stdout. – user1934428 Jun 27 '17 at 15:33
  • 1
    @dawg [beware of gawk arbitrary precision](https://stackoverflow.com/a/52485753/995714). I've checked and gawk produces incorrect output for most values compared to bc, perl or python, thus I've just filed a bug report – phuclv Sep 26 '18 at 05:57

6 Answers6

6

An optimal solution for dc(1) sums the inputs as they are read:

$ jot 1000000 | sed '2,$s/$/+/;$s/$/p/' | dc
500000500000
kdhp
  • 2,096
  • 14
  • 15
  • 1
    OP here, @kdhp: * wow, this is really good, this is what I was after! * Very good performance too: `> $ time jot 1000000 | sed '2,$s/$/+/;$s/$/p/' | dc 500000500000` `> real 0m2.987s` `> user 0m3.971s` `> sys 0m0.035s` – fgeorgatos May 08 '19 at 13:02
  • @fgeorgatos : 3.971 seconds is fast ?? `awk`s do it in `0.169 s` `time jot 1000000 | mawk2 '{__+=$_}END{print __}' 500000500000 jot 1000000 0.16s user 0.00s system 98% cpu 0.170 total mawk2 '{__+=$_}END{print __}' 0.07s user 0.00s system 45% cpu 0.169 total` – RARE Kpop Manifesto Dec 28 '22 at 21:14
3

The one I usually use is paste -sd+|bc:

$ time seq 1 20000000 | paste -sd+|bc
200000010000000

real    0m10.092s
user    0m10.854s
sys     0m0.481s

(For strict Posix compliance, paste needs to be provided with an explicit argument: paste -sd+ -|bc. Apparently that is necessary with the BSD paste implementation installed by default on OS X.)

However, that will fail for larger inputs, because bc buffers an entire expression in memory before evaluating it. On my system, bc ran out of memory trying to add 100 million numbers, although it was able to do 70 million. But other systems may have smaller capacities.

Since bc has variables, you could avoid long lines by repetitively adding to a variable instead of constructing a single long expression. This is (as far as I know) 100% Posix compliant, but there is a 3x time penalty:

$ time seq 1 20000000|sed -e's/^/s+=/;$a\' -es|bc
200000010000000

real    0m29.224s
user    0m44.119s
sys     0m0.820s

Another way to handle the case where the input size exceeds bc's buffering capacity would be to use the standard xargs tool to add the numbers in groups:

$ time seq 1 100000000 |
> IFS=+ xargs sh -c 'echo "$*"' _ | bc | paste -sd+ | bc
5000000050000000

real    1m0.289s
user    1m31.297s
sys     0m19.233s

The number of input lines used by each xargs evaluation will vary from system to system, but it will normally be in the hundreds and it might be much more. Obviously, the xargs | bc invocations could be chained arbitrarily to increase capacity.

It might be necessary to limit the size of the xargs expansion using the -s switch, on systems where ARG_MAX exceeds the capacity of the bc command. Aside from performing an experiment to establish the bc buffer limit, there is no portable way to establish what that limit might be but it certainly should be no less than LINE_MAX which is guaranteed to be at least 2048. Even with 100-digit addends, that will allow a reduction by a factor of 20, so a chain of 10 xargs|bc pipes would handle over 1013 addends assuming you were prepared to wait a couple of months for that to complete.

As an alternative to constructing a large fixed-length pipeline, you could use a function to recursively pipe the output from xargs|bc until only one value is produced:

radd () { 
    if read a && read b; then
        { printf '%s\n%s\n' "$a" "$b"; cat; } |
          IFS=+ xargs -s $MAXLINE sh -c 'echo "$*"' _ |
          bc | radd
    else
        echo "$a"
    fi
}

If you use a very conservative value for MAXLINE, the above is quite slow, but with plausible larger values it is not much slower than the simple paste|bc solution:

$ time seq 1 20000000 | MAXLINE=2048 radd
200000010000000

real    1m38.850s
user    0m46.465s
sys     1m34.503s

$ time seq 1 20000000 | MAXLINE=60000 radd 
200000010000000

real    0m12.097s
user    0m17.452s
sys     0m5.090s

$ time seq 1 100000000 | MAXLINE=60000 radd 
5000000050000000

real    1m3.972s
user    1m31.394s
sys     0m27.946s

As well as the bc solutions, I timed some other possibilities. As shown above, with an input of 20 million numbers, paste|bc took 10 seconds. That's almost identical to the time used by adding 20 million numbers with

gawk -M '{s+=$0} END{print s}'

Programming languages such as python and perl proved to be faster:

# 9.2 seconds to sum 20,000,000 integers
python -c $'import sys\nprint(sum(int(x) for x in sys.stdin))'
# 5.1 seconds
perl -Mbignum -lne '$s+=$_; END{print $s}'

I was unable to test dc -f - -e '[+z1<r]srz1<rp' on large inputs, since its performance appears to be quadratic (or worse); it summed 25 thousand numbers in 3 seconds, but it took 19 seconds to sum 50 thousand and 90 seconds to do 100 thousand.

Although bc is not the fastest and memory limitations require awkward workarounds, it has the advantage of working out of the box on Posix-compliant systems without the necessity to install enhanced versions of any standard utility (awk) or programming languages not required by Posix (perl and python).

rici
  • 234,347
  • 28
  • 237
  • 341
  • That `paste -sd+|` can be replaced with `seq`'s *separator* switch `-s`, *i.e.*: `time seq -s+ 1 20000000 | bc`. – agc Jun 27 '17 at 03:13
  • 2
    @agc: I don't believe that the input is likely to be `seq` and if it were, you could just compute the sum directly. I just used `seq` because it shows that `bc` can handle twenty million values. – rici Jun 27 '17 at 03:19
  • OP here: I like the elegance of `paste|bc`, however `bc` needs to deal with a finite buffer (at least on certain platforms), so it's not the generic solution we are after. Please, do prove me wrong if you know otherwise. – fgeorgatos Jun 27 '17 at 08:36
  • @rici: I have also come to think that `xargs|bc` incantations chained in a tree fashion (you can do more levels than 2, think like i-nodes) c/would permit super huge numbers lists processing; however, the drawback is that the limits we put in there are arbitrary and bc input buffer size changes from system to system, so it is very unclear how to break down the inputs, in a generic way. – fgeorgatos Jun 28 '17 at 01:14
  • @fgeorgatos: I mentioned the chaining possibility in my answer; I have now made it explicit with a version which doesn't need to know how long a chain to construct (although I'm not sure that it still falls in the category of "elegant one-liners"). I also added a cumulating `bc` solution using a variable which is not competitive in execution time (approximately 3x slow-down) but should be completely portable as long as you don't exceed `bc`'s line limit with a single integer. – rici Jun 28 '17 at 15:59
  • @rici: I find your radd() solution with MAXLINE=60000 the best bet for high portability, with the least of constraints (MAXLINE may give compatibility jitter across platforms, though). Happy to accept that as the best answer so far, although it is entertaining that it is not as simple as people might have thought at first sight. – fgeorgatos Jul 07 '17 at 16:40
  • @fgeorgatos: The cumulating `bc` solution is probably the most portable but the slow-down is dramatic. `radd()` is, if nothing else, an interesting case study of recursive pipelining, which has certain other applications. – rici Jul 07 '17 at 20:54
1

You can use gawk with the -M flag:

$ seq 1 20000000 | gawk -M '{s+=$0} END{print s}'
200000010000000

Or Perl with bignum enabled:

$ seq 1 20000000 | perl -Mbignum -lne '$s+=$_; END{print $s}'
200000010000000
dawg
  • 98,345
  • 23
  • 131
  • 206
  • OP here: interesting offer, didn't know about the `-M` flag of `awk`; however, this won't work on several `MacOSX` versions and I doubt that `SunOS` would be happy about it either - unless you refresh the `GNU tools` installation to a modern one, which defeats the original question's description! **** `Perl`: although i'm Perl-averse, this might be a not too bad pick, given that Perl has been available across several platforms for 20+ years. What about compatibility of `-Mbignum` though? – fgeorgatos Jun 27 '17 at 08:39
  • `defeats the original question's description`??? You specifically said `Understanding that there will be issues of portability across platforms [1] [2], is there some generic solution which is a winner on both practicality and brevity?`. So you effectively said "don't worry about portability as I understand that's an issue" and then complained now that the posted solution isn't portable (unless you install the specific tool). You specifically asked for a practical and brief solution and got one with `gawk -M`. – Ed Morton Jun 27 '17 at 13:11
  • Well I ran those two examples on MacOS. `gawk` is trivial to install with Brew. Apple itself will not be updating to GNU tools because of incompatibility with with GPL 3.0 but they are trivial to install. `bignum` dates back to 2002 and has been part of the Perl disto since maybe 2010. The perl example should run on any platform that has perl. btw: `| paste -sd+|bc` does not run on MacOS. – dawg Jun 27 '17 at 15:04
  • @dawg: What happens when you use `paste -sd+|bc` on Mac OS? `paste` and `bc` should both be available... – rici Jun 27 '17 at 15:15
  • 1
    @rici: The command on BSD / OS X would either be `seq 1 10 | paste -sd+ - | bc` (note the `-` for the file) or `paste -sd+ <(seq 1 10) | bc` In either case, with larger sequences, `bc` gives `(standard_in) 1: parse error` Works fine for smaller sequences. – dawg Jun 27 '17 at 15:33
  • @dawg: Huh. I've certainly used that command on a Mac but it would have been five years ago and it is quite possible that I had gnu coreutils installed on it. Can you quantify "larger"? – rici Jun 27 '17 at 16:39
  • @dawg: thanks for clarifying the perl situation; so even that is not that much of common ground, after all - a sigh of relief from this end ;-) ==== Ed Morton: I've made it more clear in the description that portability issues are undesired. After all, we like to write scripts which are write-once run-everywhere, aren't we? ==== all: most people have correctly assumed that we are after a posix-compliant solution, please keep making that assumption, because it is the right thing. – fgeorgatos Jun 28 '17 at 01:07
  • @rici: bigger than `seq 1 50` but smaller than `seq 1 20000000` I also tried the `xargs` approach but that did not solve it. – dawg Jun 28 '17 at 01:32
  • @rici: Turns out the issue is `seq` on OS X. Try `$ seq 999999 1000000` which produces `999999\n1e06` Now try `seq 999999 1000000 | paste -sd+ - |bc` which produces `(standard_in) 1: parse error` because of the `1e+06` So `seq` or `bc` have thier own platform issues. `bc` since it does not support 'e' scientific notation. – dawg Jun 28 '17 at 19:02
  • @dawg: good to know, thanks. I tried compiling bsd bc from source and still couldn't repro a problem. Although you clearly hit a memory limit eventually. – rici Jun 28 '17 at 19:49
1

$ seq 1000|(sum=0;while read num; do sum=`echo $sum+$num|bc -l`;done;echo $sum) 500500

Also, this one will not win a top-speed prize, however it IS:

  • oneliner, yes.
  • portable
  • adds lists of any length
  • adds numbers of any precision (each number's length limited only by MAXLINE)
  • does not rely on external tools such as python/perl/awk/R etc

with a stretch, you may call it elegant too ;-) come on guys, show the better way to do this!

fgeorgatos
  • 171
  • 1
  • 10
  • If we just need to add large integers, Bash is enough. For example, to add the size of all images in a directory: `ls -l *.png | cut -d ' ' -f 5 | (sum=0; while read num; do sum=$[$sum+$num]; done; echo $sum)` – Roberto Apr 05 '19 at 12:00
0

It seems that the following does the trick:

$ seq 1000|dc -f - -e '[+z1<r]srz1<rp'
500500

but, is it the optimal solution?

fgeorgatos
  • 171
  • 1
  • 10
  • This cannot be optimal because it is too slow to be practical on input sizes far smaller than the memory limit of bc. See my answer. – rici Jun 27 '17 at 15:26
  • @rici: interesting discovery about the quadratic performance. Not entirely surprised, but I thought it would be possible to do smarter things with a stack based language - perhaps there is another `dc` incantation... tbd. – fgeorgatos Jun 28 '17 at 00:57
0

jot really slows u down :

( time ( jot 100000000 | pvZ -i 0.2 -l -cN in0 | 

 mawk2 '{ __+=$_ } END { print __ }' FS='\n' ) )

      in0:  100M 0:00:17 [5.64M/s] [ <=> ]

( jot 100000000 | nice pv -pteba -i 1 -i 0.2 -l -cN in0 | 
mawk2  FS='\n'; )

26.43s user 0.78s system 153% cpu 17.730 total
5000000050000000

using another awk instance to generate the sequence shaves off 39.7% :

 ( time ( 

       mawk2 -v __='100000000' '
       BEGIN { for(_-=_=__=+__;_<__;) { 
                       print ++_ } }' | 

  pvZ -i 0.2 -l -cN in0 | 

       mawk2 '{ __+=$_ } END{ print __ }' FS='\n' ))

      in0:  100M 0:00:10 [9.37M/s] [ <=> ]

 ( mawk2 -v __='100000000' 'BEGIN {…}' | )

 19.44s user 0.68s system 188% cpu 10.687 total
5000000050000000

for the bc option, gnu-paste is quite a bit faster than bsd-paste in this regard, but both absolutely pale compared to awk, while perl is only slightly behind :

time jot 15000000 | pvE9 | mawk2 '{ _+=$__ } END { print _ }'
     out9:  118MiB 0:00:02 [45.0MiB/s] [45.0MiB/s] [ <=> ]

112500007500000

jot 15000000   2.60s user 0.03s system 99% cpu 2.640 total
pvE 0.1 out9   0.01s user 0.05s system  2% cpu 2.640 total
mawk2 '{...}'  
               1.09s user 0.03s system 42% cpu 2.639 total
perl -Mbignum -lne '$s+=$_; END{print $s}'             # perl 5.36

               1.36s user 0.03s system 52% cpu 2.662 total
time jot 15000000 | pvE9 | gpaste -sd+ -|bc
     out9:  118MiB 0:00:02 [45.3MiB/s] [45.3MiB/s] [ <=> ]

112500007500000

jot 15000000   2.59s user 0.03s system 99% cpu 2.627 total
pvE 0.1 out9   0.01s user 0.05s system  2% cpu 2.626 total
gpaste -sd+ -  0.27s user 0.03s system 11% cpu 2.625 total # gnu-paste
bc             4.55s user 0.46s system 66% cpu 7.544 total
time jot 15000000 | pvE9 | paste -sd+ -|bc
     out9:  118MiB 0:00:05 [22.7MiB/s] [22.7MiB/s] [ <=>  ]

112500007500000

jot 15000000  2.63s user 0.03s system 51% cpu  5.207 total
pvE 0.1 out9  0.01s user 0.06s system  1% cpu  5.209 total
paste -sd+ -  5.14s user 0.05s system 99% cpu  5.211 total  # bsd-paste
bc            4.53s user 0.40s system 49% cpu 10.029 total
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11