0

I'm trying to select the list of ID's where specific position is not empty (for ex; position 29,30,103 and 104). If the position is empty it should be rejected, tried with awk its working well with less data (<100) but all the ID's are getting selected with big data (>1000000). Please provide suggestion.

awk '
    {FS=",";$0=$0;
     if ($29!="" && $30!="" && $323!="" && $324!= "") print "ID", NR, "selected" }
' file.csv 

this cmd works only with small data, please advice where i'm missing out.

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • 1
    Could you please provide a data sample? – Alexandre Juma Nov 24 '22 at 11:19
  • 2
    Your CSV might have DOS line-endings making $324 non-empty if it's the last field, or contain some quoted fields with commas and/or newlines in them. See [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk). In lieu of a better approach, use divide-and-conquer on a failing CSV to identify a line that gets printed that shouldn't, then apply divide-and-conquer on the fields in that line to identify which one causes the problem. – Ed Morton Nov 24 '22 at 11:24
  • 1
    Also, I guess setting FS and recompiling fields with $0=$0 could be avoided by spinning up awk with -F"," directly. – Alexandre Juma Nov 24 '22 at 11:27
  • 1
    For more info on the DOS line endings issue, see [why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it](https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it). – Ed Morton Nov 24 '22 at 11:32
  • I just tested a similar procedure on a file of 1,000,000 records, each with ~150 fields (not an exact replication) and saw no similar problem. I can confirm, as commented by @AlexandreJuma above, that setting the field separator for every record slows it down significantly. Use either `-F","` as a shell switch or use `BEGIN{FS=","}` inside awk. `$0=$0` is not needed. – Dave Pritlove Nov 24 '22 at 22:49
  • Thank you @AlexandreJuma, Ed Morton and Dave Pritlove for suggestions, its working fine with large data but please suggest how to make the condition stringent, I have to reject all the ID's even if one of the position is empty, but this cmd selecting the ID's only when all the four positions are empty. – user20578273 Nov 25 '22 at 03:32
  • @user20578273 the logical behavior you just described in your requirement is a logical OR (||) and not a logical AND (&&). I will answer this question taking this into consideration. – Alexandre Juma Nov 25 '22 at 22:51

1 Answers1

1

I've generated a file with 1024456 records. Each record has 152 fields separated by commas and each record is generated with a probability of getting empty fields across the records fields.

Sample of the data:

,,29378,,,,10154,,,,,,,,6118,,29,15384,,,,27106,30693,,,,2021,,,,,30609,,15148,,,3406,10181,,,,178,,,,,,,,31308,10049,,,14783,,,,,26032,,,,21999,,,15978,,,,,,12975,22933,,,18981,,,,,,21590,21196,,,,,,14680,,18167,9839,,,5282,,,27112,,,1264,,,22086,,,,,,,,,,,,,,18940,,,11353,,,29966,32569,2495,,11841,,25529,,15423,,,,2799,,15511,,,3010,,,4359,,,,,,12244,18968,13926

As expected, explicitly avoiding retokenizing every record will yield better results:

for run in {1..10}; do \
    /usr/bin/time --format='%C took %e seconds' \
    awk -F"," \
    '{if ($29!="" || $30!="" || $92!="" || $132!= "") print "ID", NR, "selected" }' \
    file1.txt > /dev/null; 
done

awk (...) took 3.36 seconds
awk (...) took 3.35 seconds
awk (...) took 3.78 seconds
awk (...) took 3.48 seconds
awk (...) took 3.58 seconds
awk (...) took 3.75 seconds
awk (...) took 3.49 seconds
awk (...) took 3.53 seconds
awk (...) took 3.47 seconds
awk (...) took 3.93 seconds

Than the original solution the OP provided:

for run in {1..10}; do \
    /usr/bin/time --format='%C took %e seconds' \
    awk \
    '{FS=",";$0=$0; if ($29!="" || $30!="" || $92!="" || $132!= "") print "ID", NR, "selected" }' \
    file1.txt > /dev/null; 
done

awk (...) took 9.04 seconds
awk (...) took 8.93 seconds
awk (...) took 9.05 seconds
awk (...) took 9.14 seconds
awk (...) took 8.93 seconds
awk (...) took 9.05 seconds
awk (...) took 8.76 seconds
awk (...) took 9.72 seconds
awk (...) took 9.29 seconds
awk (...) took 9.17 seconds

PS: I've run each solution 10 times to average out the results and I also made a slight change to the OP code as per the requirement (only one of the fields must be empty for the record to be selected).

Alexandre Juma
  • 3,128
  • 1
  • 20
  • 46