11

I have a pipe delimited feed file which has several fields. Since I only need a few, I thought of using awk to capture them for my testing purposes. However, I noticed that printf changes the value if I use "%d". It works fine if I use "%s".

Feed File Sample:

[jaypal:~/Temp] cat temp

302610004125074|19769904399993903|30|15|2012-01-13 17:20:02.346000|2012-01-13 17:20:03.307000|E072AE4B|587244|316|13|GSM|1|SUCC|0|1|255|2|2|0|213|2|0|6|0|0|0|0|0|10|16473840051|30|302610|235|250|0|7|0|0|0|0|0|10|54320058002|906|722310|2|0||0|BELL MOBILITY CELLULAR, INC|BELL MOBILITY CELLULAR, INC|Bell Mobility|AMX ARGENTINA SA.|Claro aka CTI Movil|CAN|ARG|

I am interested in capturing the second column which is 19769904399993903.

Here are my tests:

[jaypal:~/Temp] awk -F"|" '{printf ("%d\n",$2)}' temp
19769904399993904   # Value is changed

However, the following two tests works fine -

[jaypal:~/Temp] awk -F"|" '{printf ("%s\n",$2)}' temp
19769904399993903   # Value remains same

[jaypal:~/Temp] awk -F"|" '{print $2}' temp
19769904399993903   # Value remains same

So is this a limit of "%d" of not able to handle long integers. If thats the case why would it add one to the number instead of may be truncating it?

I have tried this with BSD and GNU versions of awk.

Version Info:

[jaypal:~/Temp] gawk --version
GNU Awk 4.0.0
Copyright (C) 1989, 1991-2011 Free Software Foundation.

[jaypal:~/Temp] awk --version
awk version 20070501
Community
  • 1
  • 1
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
  • 1
    what happens if use awk's `printf "%17.0f\n"`? My experience with awk says to post this question on comp.lang.awk. Good luck! – shellter Jan 13 '12 at 23:04
  • Thanks @shellter. I got the same result. Surprisingly it only happens inside of `awk`. If I do `printf %d` and my value on the command line , it prints correctly. If I do the same inside `awk's BEGIN` statement it messes it up. :) – jaypal singh Jan 13 '12 at 23:11
  • It printed correct number with the version of awk that is part of the UWIN system. I think it boils down to the 'quality' of the version of C-lib functions linked in with your version of awk. Also do you have access to a 64bit machine and a 64bit awk/gawk? Good luck. – shellter Jan 14 '12 at 00:00
  • Hmm unfortunately no .. so basically we can call it a bug as I can do `printf %d` on the CLI and it works. It's weird that even the GNU 4.0.0 version of `awk` reproduces this. – jaypal singh Jan 14 '12 at 01:50
  • The internal printf command in new(ish) kshs as found in some linux (Unbuntu among others) and UWIN will also be correct (just like bash). How much data are you talking about processing? And is off-by-one on a 17digit number really significant to your problem?. (oh, that doesn't look like financial data, so it is likely significant ;-) ). Good luck. – shellter Jan 15 '12 at 04:06

7 Answers7

13

Starting with GNU awk 4.1 you can use --bignum or -M

$ awk 'BEGIN {print 19769904399993903}'
19769904399993904

$ awk --bignum 'BEGIN {print 19769904399993903}'
19769904399993903

§ Command-Line Options

5

I believe the underlying numeric format in this case is an IEEE double. So the changed value is a result of floating point precision errors. If it is actually necessary to treat the large values as numerics and to maintain accurate precision, it might be better to use something like Perl, Ruby, or Python which have the capabilities (maybe via extensions) to handle arbitrary-precision arithmetic.

Mark Wilkins
  • 40,729
  • 5
  • 57
  • 110
  • Thanks Mark, so how can we handle such numbers with `printf`? It's not a show stopper for me but just wanted to know for learning purposes – jaypal singh Jan 13 '12 at 22:19
  • 1
    I don't think it is possible to represent a number in AWK accurately. My understanding (which may be incorrect) is that awk always uses double precision to store numeric values. As long as you don't need to perform math operations, then the best bet is to print/use them as strings (which you already found out). – Mark Wilkins Jan 13 '12 at 22:25
  • Correct. According to `info gawk`: "The internal representation of all numbers, including integers, uses double-precision floating-point numbers. On most modern systems, these are in IEEE 754 standard format." – Dennis Williamson Jan 18 '12 at 18:10
4

UPDATE: Recent versions of GNU awk support arbitrary precision arithmetic. See the GNU awk manual for more info.

ORIGINAL POST CONTENT: XMLgawk supports arbitrary precision arithmetic on floating-point numbers. So, if installing xgawk is an option:

zsh-4.3.11[drado]% awk --version |head -1; xgawk --version | head -1
GNU Awk 4.0.0
Extensible GNU Awk 3.1.6 (build 20080101) with dynamic loading, and with statically-linked extensions

zsh-4.3.11[drado]% awk 'BEGIN {
  x=665857
  y=470832
  print x^4 - 4 * y^4 - 4 * y^2
  }'
11885568

zsh-4.3.11[drado]% xgawk -lmpfr 'BEGIN {
  MPFR_PRECISION = 80
  x=665857
  y=470832
  print mpfr_sub(mpfr_sub(mpfr_pow(x, 4), mpfr_mul(4, mpfr_pow(y, 4))), 4 * y^2)
  }'
1.0000000000000000000000000
Dimitre Radoulov
  • 27,252
  • 4
  • 40
  • 48
  • 1
    https://sourceforge.net/projects/gawkextlib/files/xgawk/ says that GNU awk 4.1 obsoletes `xgawk` as a separate binary. It recommends `gawk` with `gawkextlib`. And your xgawk link is dead. I wasn't sure which link would be best, so I didn't edit your post myself. – Peter Cordes Aug 01 '15 at 20:26
3

This answer was partially answered by @Mark Wilkins and @Dennis Williamson already but I found out the largest 64-bit integer that can be handled without losing precision is 2^53. Eg awk's reference page http://www.gnu.org/software/gawk/manual/gawk.html#Integer-Programming

(sorry if my answer is too old. Figured I'd still share for the next person before they spend too much time on this like I did)

3150
  • 95
  • 8
1

You're running into Awk's Floating Point Representation Issues. I don't think you can find a work-around within awk framework to perform arithmetic on huge numbers accurately.

Only possible (and crude) way I can think of is to break the huge number into smaller chunk, perform your math and join them again or better yet use Perl/PHP/TCL/bsh etc scripting languages that are more powerful than awk.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thanks Anubhava. That sounds right, coz when I do this at the command line, it prints it fine `[jaypal:~/Temp] printf "%d" 19769904399993903 19769904399993903` – jaypal singh Jan 13 '12 at 22:32
0

Using nawk on Solaris 11, I convert the number to a string by adding (concatenate) a null to the end, and then use %15s as the format string:

printf("%15s\n", bignum "")   
Stibu
  • 15,166
  • 6
  • 57
  • 71
  • avoid `nawk` in general if possible - because both evaluates to true in `nawk` ::::::::: : `nawk 'BEGIN { print ("\1" ~ "\0"), ("\304\200" ~ /\400/) }' ==> 1 1` :::::: 1st one stemming from fact `nawk` incorrectly counts terminating `null byte \0` of a `c-string` as part of it for `regex` purposes (in contradiction w/ values from its own `length()`), ….. – RARE Kpop Manifesto Jul 19 '23 at 11:24
  • ….. while 2nd one is `nawk`'s incorrectly discarding modular wrap-around effects for 9-bit octal codes `\400-\777`, so instead of a regex checking for null byte, it checks for `256-th code-point of UTF-8`, which would be `\304\200 | C4 80` – RARE Kpop Manifesto Jul 19 '23 at 11:24
0

another caveat about the precision : the errors pile up with extra operations ::

echo 19769904399993903 | mawk2 '{ CONVFMT = "%.2000g";
                                     OFMT =   "%.20g"; 
        } {
           print;
           print +$0; 
           print $0/1.0
           print $0^1.0; 

           print exp(-log($0))^-1; 
           print exp(1*log($0))
           print sqrt(exp(exp(log(20)-log(10))*log($0))) 
           print (exp(exp(log(6)-log(3))*log($0)))^2^-1   
        }'
19769904399993903
19769904399993904
19769904399993904
19769904399993904
19769904399993912
19769904399993908
19769904399993628 <<<—— -275
19769904399993768 <<<—- -135

The first few only off by less than 10. last 2 equations have triple digit deltas.

For any of the versions that require calling helper math functions, simply getting the -M bignum flag is insufficient. One must also set the PREC variable.

For this exmaple, setting PREC=64 and OFMT="%.17g" should suffice.

Beware of setting OFMT too high, relative to PREC, otherwise you'll see oddities like this :

gawk -M -v PREC=256 -e '{ CONVFMT="%.2000g"; OFMT="%.80g";... } '

19769904399993903
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734

since 80 significant digits require precision of at least 265.75, so basically 266-bits, but gawk is fast enough that you can probably safely pre-set it at PREC=4096/8192 instead of having to worry about it everytime

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11