Removing lines from file using AWK fails in large files

Question

Here is my command:

awk 'FNR==NR{arr[$1];next}!($3 in arr)' supp.txt data.txt > res.txt

Where supp.txt 's content is similar to:

hash1
hash2
hash3

and data.txt :

8723 email hash1
8724 email hash4
8725 email hash5

The values are different and the files can be up to 1Gb of size, res.txt is supposed to have data.txt minus the lines existing in supp.txt

So res.txt should be something like this:

8724 email hash4
8725 email hash5

This works just fine with small files, but files are as big as 10Mb are failing without any error message, the function simply copies data.txt to res.txt, allowing values from supp.txt even though they're supposed to be removed.

Why is this happening ? and what is the work around?

I learned AWK basics, which helped me make sense of the command but nothing more than that I googled the issue, without any luck on finding a similar one I made sure I have enough memory just in case

Could be `randomstring` contains spaces (since it's apparently random), or `supp.txt` or `data.txt` has DOS line endings or something else. Without sample input/output that reproduces the problem we can't do much to help you debug it. Apply divide and conquer to your real files til you get to the smallest possible files that have the problem and then post those in your question if you don't already see the problem yourself by that point. — Ed Morton, Mar 27 '22 at 20:22
Having said that - the command you posted will not work fine with small files as you claim it does. It'll produce no output or blank lines if any exist in data.txt. ITYM `awk 'FNR==NR{arr[$1];next}!($3 in arr)' supp.txt data.txt` (note the parens). — Ed Morton, Mar 27 '22 at 20:26
Thank you for chiming in, I appreciate it, I am poking at this right now, and found out that the array "arr" is filled with empty values for some reason, I have tested the same command with smaller chunks of the same file and the output was just what i wanted, i had already deployed this in an app before people started complaining it doesn't and it seem to not work only in big files which I cannot seem to comprehend yet — Ahmed Serro, Mar 27 '22 at 20:30
You're welcome. The array `arr[]` **should** be filled with empty values since you're correctly only populating the indices. As I mentioned, the command in your question will NOT behave the way you say it does. Without the actual command you want help with and sample input/output that demonstrates the problem we can't help much but it's **extremely** unlikely that this problem is related to big files and far more likely that those big files just happen to have some specific lines in them or are formatted a specific way, see my previous comment for suggestions. — Ed Morton, Mar 27 '22 at 20:44
I changed it to `awk 'FNR==NR{arr[NR]=$1;next}!($3 in arr)' supp.txt data.txt`, now I am able to see all the values in the array by adding a `END` statement, but the result is still not what I want, I don't know if it is matching, the examples here are basically the same as the real input, with hash just being an md5 hash, no other spaces are involved, I've been stuck here since yesterday and I had no prior awk knowledge before hand, this line I simply copied off another question on here and after testing on a smaller sample it worked, so I'm ashamed to admit that I'm a bit frustrated — Ahmed Serro, Mar 27 '22 at 21:08
Don't do `arr[NR]=$1`, it's nonsense - just think about what it means, it's nothing to do with your problem. Your original command was right except missing parens as I specifically stated in [my 2nd comment](https://stackoverflow.com/questions/71639375/removing-lines-from-file-using-awk-fails-in-large-files?noredirect=1#comment126611840_71639375). I also told you in [my 1st comment](https://stackoverflow.com/questions/71639375/removing-lines-from-file-using-awk-fails-in-large-files?noredirect=1#comment126611799_71639375) how to come up with minimal sample input that can reproduce the problem. — Ed Morton, Mar 27 '22 at 21:13
@Cyrus that would produce false matches against `randomstring`. — Ed Morton, Mar 27 '22 at 21:17
Thank you Cyrus, I tried that but it did not seem to work, Sir Morton I apologize for my slowness, I now understand what you meant, by the way the missing parens were a typo, I fixed it, and changed `randomstring` to `email` because that is what it is, I do believe you are correct when you suggest it could be a DOS line ending, which might explain why some files work while others don't, how would I go about forcing trimming the string or forcing the end of line to be the same you beautiful genius ? — Ahmed Serro, Mar 27 '22 at 21:25
That actually worked ! Thank you Cyrus, and thank you Ed for pointing out the DOS thing, I just learned about this now, thought \n is \n .. but I feel like `awk` could be a little faster, and is still curious to know how the same thing would be done using `awk`, I assume I could use `dos2unix` for it too but if there is a more efficient way, nothing learned is wasted — Ahmed Serro, Mar 27 '22 at 21:34

score 0 · Accepted Answer · answered Mar 28 '22 at 00:16

0

If the problem is DOS line endings then:

awk '{sub(/\r$/,"")} FNR==NR{arr[$1];next}!($3 in arr)' supp.txt data.txt > res.txt

See Why does my tool output overwrite itself and how do I fix it? for alternative ways to handle them.

answered Mar 28 '22 at 00:16

Ed Morton

188,023
17
78
185

Removing lines from file using AWK fails in large files

1 Answers1