1

I have a large CSV file(~1GB) - data.csv
A variable with a huge list of Pipe separated strings.
list="abc|def|ghi.........."

Objective is to search 2nd and 3rd columns of data.csv for each and every string listed in list variable and replace it with the string unassigned

Below is what i came up with,
awk -v list="$list" 'BEGIN{FS=OFS=","}{gsub(list,"unassigned",$2)}{gsub(list,"unassigned",$3)}1' data.csv > data_new.csv

It works fine as long as the list is small. Once the list variable crosses around 10k strings, it'll throw the error
/usr/bin/awk: Argument list too long

Is there any solution to handle this long list here? Totally new solution is also welcome. Thanks in advance.

Note : Would prefer to avoid looping through list since it degrades the performance.

codeforester
  • 39,467
  • 16
  • 112
  • 140
  • 1
    I'm not seeing how that can work even if the list is only 2 values `list='foo,bar'` since the gsub will be looking for a regexp `foo,bar` instead of what I **think** you'd want which is `foo|bar`. [edit] your question to include concise, testable sample input and expected output so we can help you. – Ed Morton May 31 '18 at 19:23
  • Thanks for the update. there is a `tr "," "|"` which i missed here. Corrected the variable data. – Optimus Prime May 31 '18 at 19:44
  • is your line separated by `|` or `,` ? – Haifeng Zhang May 31 '18 at 19:46
  • data.csv is separated by `,` which is why i added `BEGIN{FS=OFS=","}` to define it as the seperator – Optimus Prime May 31 '18 at 19:49

1 Answers1

1

Do this instead if you're using bash:

awk '
BEGIN { FS=OFS="," }
NR==FNR { list=$0; next}
{ gsub(list,"unassigned",$2); gsub(list,"unassigned",$3) }
1' <<<"$list" data.csv > data_new.csv

See Why do I get "/bin/sh: Argument list too long" when passing quoted arguments? and Does "argument list too long" restriction apply to shell builtins? for what's happening with your original code and how this solves the problem.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    "list" is missing on the assignment. Also `NR==1` perhaps better here... – karakfa May 31 '18 at 19:59
  • Tried the snippet but no data comes out. – Optimus Prime May 31 '18 at 20:23
  • @karakfa thanks, corrected wrt `list`. wrt `NR==1` vs `NR==FNR` - I could go either way but at least NR==FNR lets you know the input is coming from the first input stream so I slightly prefer that. – Ed Morton May 31 '18 at 22:10
  • @OptimusPrime try it now. – Ed Morton May 31 '18 at 22:11
  • same result as before. data_new.csv is empty. – Optimus Prime Jun 01 '18 at 08:37
  • The "1" at the end should print the line out. but due to some strange reason its not printing anything out. – Optimus Prime Jun 01 '18 at 08:43
  • Assuming you'd have told us if there was any error message, the **only** way for that script to produce no output is if you don't provide it input. Add `{print FILENAME, $0}'` above the `NR==FNR` line to see what input it's getting. It is obviously untested though because you didn't provide anything for us to test against in your question. – Ed Morton Jun 01 '18 at 13:30
  • There was no error, just nothing comes out. And i couldnt provide a test data set since i dont have the issue with small data set. Now when i added the print as suggested, data_new got filled with data.csv – Optimus Prime Jun 01 '18 at 23:01
  • Providing a small data set would let us test that the script does produce the expected output for that at least though. Don't you have the same problem you're having now (no output) with a small data set? You said data_new got filled with data.csv. You did NOT say that the first line of data.csv was the value of "$line" which is what would be expected. The implication is that either "$line" is empty or you aren't using exactly the script I posted to get it passed to the awk script or there's a bug in the awk script (untested since you didn't provide sample input) that isn't reading "$line" – Ed Morton Jun 02 '18 at 12:11