0

Is there better way to escape pipe(|) from the pipe separated file?

I have pipe separated file, I tried running the awk command, it is working fine only for the records who doesn't have escaped double quotes. because it is considering field separator as double quotes.

Input file:

"first | last | name" |" john | white "| age | 52
school |" ABC | USA "| year | 2016
Home | Road | year\" | 1989\" 
company |" Pvt | ltd "| joining | 2019

Code:

awk '
BEGIN { FS=OFS="\"" }              
  { for (i=2;i<=NF;i+=2)      
        gsub(/\|/,"\\|",$i)    
    print
  }
' testfile.txt

Output I am getting:

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" \| 1989\" 
company |" Pvt \| ltd "| joining | 2019

Expecting output :

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\" 
company |" Pvt \| ltd "| joining | 2019

In 3rd Row, it is escaping pipe after year, but it is wrong as that double quote is part of 3rd column. Can I work on particular column to escape pipe if it belongs to same column

Kalpesh
  • 694
  • 2
  • 8
  • 28
  • Could you please explain more on how you want to perform substitution part? Thank you. – RavinderSingh13 Aug 05 '22 at 15:15
  • `that double quote is part of 3rd column`: Can you tell what text should be in 3rd column? Isn't 3rd column just `" ABC \| USA "` ? – anubhava Aug 05 '22 at 15:18
  • 1
    Your input is CSV-like (just with `|` instead of `,` as the separator) but then it doesn't conform to any of the CSV "standards" in that it has `"`s in fields that aren't quoted, e.g. `year\"` instead of `"year"""`. Are you sure that's the format you're using as it would be a very odd choice to choose not to conform to any standard and so make things more difficult than necessary (as you apparently have discovered). Can't you just change to use a standard format like https://www.rfc-editor.org/rfc/rfc4180 (but keeping `|` instead of `,`)? If you did that then your awk script would work as-is. – Ed Morton Aug 05 '22 at 15:45
  • @anubhava if you chck 3rd row, 3rd column should be year\" and 4th column is 1989\".. where as in 2nd row 3rd column is year. and second column is "ABC \| USA" – Kalpesh Aug 05 '22 at 15:48
  • @EdMorton this is my input .. I don't have control on that file creation.. they are enclosing only | in double quotes, rest I need to handle – Kalpesh Aug 05 '22 at 15:49
  • Can't you push back on that since they're presumably asking you to handle CSV input but then sending you something that's not CSV by any standard definition? – Ed Morton Aug 05 '22 at 15:51
  • If you do manage to get real CSV input then see [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) for how it can be handled with awk. If you really want to try to take on the challenge in your question, though, then make sure to include a line like `this | "is\\" | tricky\\\"` in your sample input to make sure an escaped backslash before a closing `"` is handled correctly as, presumably, ```\\``` followed by `"` rather than ```\``` followed by `\"`, etc.. Good luck! – Ed Morton Aug 05 '22 at 16:18
  • Also include `this | "is\" | also" | tricky` in your sample input (assuming that's how you'd include a quote plus a pipe symbol in a quoted string, to catch another edge case (I'm sure there's more....) – Ed Morton Aug 05 '22 at 17:27

2 Answers2

2

Looking at your example set of input & output, following awk solution may work for you:

awk '{
   p = $0
   while (match(p, /(^|[^\\])"[^"\\]*(\\.[^"\\]*)*"/)) {
      m = substr(p,RSTART+1,RLENGTH-1)
      gsub(/\|/, "\\|", m)
      buf = buf substr(p,1,RSTART) m
      p = substr(p,RSTART+RLENGTH)
   }
   $0 = buf p
   buf = ""
} 1' file

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019

Alternative one-liner solution using perl:

perl -pe 's/(?<!\\)"[^"\\]*(?:\\.[^"\\]*)*"/$&=~s~\|~\\|~gr/ge' file

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019

RegEx Demo of regex used above

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

One common approach would see \" replaced with a nonsensical string, the normal operation performed, then the nonsensical string replaced with \". The nonsensical string will need to be something that does not show up in the original input and it cannot include characters that are treated specially by awk (eg, & when used in string replacement functions).

One awk idea:

awk '
BEGIN { FS=OFS="\"" }
      { gsub(/\\"/,"@@%%",$0)             # replace \" with @@%% then continue with original code
        for (i=2;i<=NF;i+=2)
            gsub(/\|/,"\\|",$i)
        gsub(/@@%%/,"\\\"")               # replace @@%% with \"
        print
      }
' testfile.txt

This generates:

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019

NOTES: as mentioned in comments ...

  • life would be much easier if the input used a common/standard format
  • potential solutions get messier as the input formats get messier (eg, this answer will have issues properly processing something like "is\\")
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • 2
    The value of `RS` is the common "nonsensical string" to use for things like this since it's guaranteed to not be present in RS-separated input and the default value of `\n` is easy to work with as it's just 1 char and so can be negated in a bracket expression if/when necessary. – Ed Morton Aug 05 '22 at 17:06
  • @markp-fuso : if u wanna go the "non-sensical bytes" route, something like `\301\30\1` would be close to absolutely bullet proof, even for most binary files - just avoid `\0` or `\377` - those appear in large chunks in binary files all the time. if u don't wanna use `ASCII control` bytes, then maybe `\371\300` or even `\556\666\776` or `\531\642\753` - or just any combination that thoroughly violates `UTF-8` rules. `gawk` unicode mode is fine with those are long as you don't run it through `length()` function, or `match()` function with a regex of anything except `match( str, /$/)`) – RARE Kpop Manifesto Aug 05 '22 at 20:19
  • 2
    ….. but `g/sub()/split()` functions are cool with these type of non-sensical seps. `ASCII` control bytes are still fully valid `UTF-8`, so i use that route only very occasionally. The `\ 3 [067] [0-7]` space has all 13 `UTF-8-invalid` bytes. for u to pick. ps : `match( str, /$/ ) - 1` is a trick i found that could get `gawk` unicode mode to spit out the proper byte-count of ANY input, including full blown binary data, and not complain at all – RARE Kpop Manifesto Aug 05 '22 at 20:28