3

I just ran these two commands on a file having around 250 million records.

awk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt

and

nawk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt

The record length is 482. The first command gave the correct number of records in file2.txt i.e.; 60 million but the nawk command gives only 4.2 million.

I am confused and would like to know if someone has come across issue like this. How exactly this simple command being treated in a different way internally? Is there a buffer which can hold only up to certain number of bytes while using nawk?

would appreciate if someone can throw some light on this.

My OS details are

SunOS <hostname> 5.10 Generic_147148-26 i86pc i386 i86pc
Ankit
  • 1,250
  • 16
  • 23
  • If the command didn't fail somehow, probably the buffer in nawk was set to a limit. – konsolebox Sep 13 '13 at 15:00
  • Can you rephrase your question to eliminate the `>>` append into `file2.txt`? Maybe have the nawk version `> file3.txt`? I assume that you realize this is happening, but given your code examples, what you report can't possibly be true. Did you try `nawk '...' file1.txt > file2.txt`, eliminating the redirect into the script? Shouldn't make any difference, but worth a try. Also, I would examine the raw data at the point of the 4.2mill+1 record and be sure there isn't some weird character in the file, again, it shouldn't matter, but ??. Good luck. – shellter Sep 13 '13 at 15:03
  • @shellter, I tried with nawk '{if(substr($0,472,1)=="9") print $0}' < file1.txt > file3.txt, the result is the same. Also 4.2m+1 seems to be correct and it captured by awk but not nawk. – Ankit Sep 13 '13 at 15:07
  • @konsolebox is there a way I can check the buffer limit ? – Ankit Sep 13 '13 at 15:08
  • @Ankit You probably can see that in the source code of nawk. – konsolebox Sep 13 '13 at 15:09
  • 1
    @konsolebox : what buffer limit? Except for line-size, nawk (should be) processing one line at a time, right? I used to process files with ~10 mill lines with nawk, back in the day, and would have expected it to work for any number of lines. @Ankit: please show us result of `which awk`, `which nawk`. Good luck to all! – shellter Sep 13 '13 at 15:18
  • @Ankit in your question you say `The first command gave the correct number of records in file2.txt i.e.; 60 million but the nawk command gives only 4.2 million.` but then in your comment above you say the opposite `Also 4.2m+1 seems to be correct and it captured by awk but not nawk`. Please state clearly which output you think is correct and which tool is producing that output. – Ed Morton Sep 13 '13 at 17:14

2 Answers2

7

The difference probably lies on the buffer limit of Nawk. One of the records (lines) found in your input file has probably exceeded it.

This crucial line can be found in awk.h:

#define RECSIZE (8 * 1024)  /* sets limit on records, fields, etc., etc. */
konsolebox
  • 72,135
  • 12
  • 99
  • 105
  • 1
    learn something everyday! Still hoping to see which `awk` @Ankit is accessing, if its `/usr/bin/awk`, I'd really be surprised, while if it is `/usr/xpg4/bin/awk`, then that is just interesting. Also, I wouldn't assume that something that lives at `netbsd.org` is the same `nawk` that is found "SunOS 5.10", but I could be wrong about that too ;-! ) in Good luck to all. – shellter Sep 13 '13 at 16:07
  • 3
    @Ankit you wrote 'The record length is 482'. Doesn't seem right that a record that is (8 * 1024) would be a valid record. Good luck. – shellter Sep 13 '13 at 16:09
  • I agree with @shelter. If the length of each line is 482 characters then that is not exceeding the buffer size on any one record. Also, if each record is the same size then they'd ALL be exceeding the buffer size not just some of them. There's something else going on here. – Ed Morton Sep 13 '13 at 17:03
  • I admit I actually didn't mind (didn't notice) much about the record length (the one specified). I kept thinking more about the number of lines (60 million/4.2 million) and that on such a large number it's likely that some of those lines could be longer than common. I agree that there could be another possibility though seeing the note. – konsolebox Sep 13 '13 at 18:08
  • @konsolebox So the solution is editing this .h file? – lolololol ol Apr 26 '17 at 00:47
  • @lololololol Well hacking the awk utility is one "solution", but I can't answer your question, unless I decide to study the source code again. However if you do decide to hack it, I'd suggest that you confirm first that the problem still exists in the newest version of nawk, and that the "RECSIZE" limit is actually the one causing the problem. – konsolebox Apr 26 '17 at 20:03
2

Your command can be reduced to just this:

awk 'substr($0,472,1)==9'

On Solaris (which you are on) when you run awk by default you are running old, broken awk (/usr/bin/awk) so I suspect that nawk is the one producing the correct result.

Run /usr/xpg4/bin/awk with the same script/arguments and see which of your other results it's output agrees with.

Also, check if your input file was created on Windows by running dos2unix on it and see if it's size changes and, if so, re-run your awk commands on the modified files. If it was created on Windows then it will have some control-Ms in there that could be causing chaos.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185