-1

I have to handle some data that comes in pipe delimited files, where each field is enclosed in double quotes.

"Boolean"|"dada -sdf|xcvnb"|"123"

If I take FS="|", then the script takes the above as four fields, whereas this is actually three fields. If I take FS="\"|\"" then I have two issues:

  1. I have to deal with 1st and the last field that becomes, "Boolean and 123" separately
  2. And more importantly, Now since we don’t have double quotes anymore, when I take each field and process, some functions or commands may not take the whole string that is there in the field(since they may be separated by spaces and different other characters).e.g. the 2nd field becomes dada -sdf|xcvnb i.e. without quotes, which for some commands may give erroneous results as - may be interpreted as options, or only the 1st word may be taken as argument and rest of the string after space is not taken into consideration at all.

My thought - I want to tell gawk that take FS as | only if it is followed by a " and preceded by a ". That way I don't strip off the double quotes from the fields.

How can I write the code? Is there a way?

DPR
  • 25
  • 6
  • Awk isn't great with quote delimited fields, but gawk does have an advanced feature to deal with it. Take a look here: https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html – jas Sep 30 '16 at 10:43

3 Answers3

0

With gawk you can use the FPAT variable do define a pattern how a field looks like. In your case a field consists of a " followed by zero or more arbitrary characters except of the " and a final ".

Having this you can simply print the first field $1 and the last field $NF:

gawk '{print $1,$NF}' FPAT='"[^"]*"' OFS="|" \
    <<< '"Boolean"|"dada -sdf|xcvnb"|"123"'

Output:

"Boolean"|"123"
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Thank you for your valuable suggestion. So can I use FPAT option and raise an exception when there is a double quote ["] inside the field? How will it identify which one is the ending double quote. Consider this ("dada -sdf"|xcvnb") Which one will be taken as the data field "dada -sdf" or "dada -sdf"|xcvnb"? And can we use the FS option along with FPAT option? – DPR Sep 30 '16 at 11:34
  • You need to fix the input format. When fields are enclosed by quotes they must not contain the quote itself. If fields are separated by |, the | must not appear in the data. That's it. – hek2mgl Sep 30 '16 at 12:59
0

I want to tell gawk that take FS as | only if it is followed by a " and preceded by a "

You can do that exactly in perl with positive lookahead and lookbehind...

$ perl -F'/"\K\|(?=")/' -le 'print "$F[0] : $F[1] : $F[2]"' ip.txt 
"Boolean" : "dada -sdf|xcvnb" : "123"
  • -F to specify a field separator and save in @F array
    • also sets the -n flag which means iterate over input files in a loop, lines are NOT printed by default
    • To split on spaces, one can simply use -a option
  • '/"\K\|(?=")/' using regex to specify field separator. "\K is positive lookbehind and (?=") is positive lookahead. \| is the separator (| needs to be escaped)
  • -l strip newline from input lines and add newline to print statements
  • -e allows to pass perl code directly instead of from file
  • $F[0] array index starts with 0, this one refers to first element of @F array


Further reading:

Graham
  • 7,431
  • 18
  • 59
  • 84
Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • Thank you. But I am using ksh for scripting. Can I use the perl command inside the ksh script? – DPR Sep 30 '16 at 11:35
  • yes, if you have perl installed.. example above works as command like awk/sed/etc... – Sundeep Sep 30 '16 at 11:43
  • Could you please explain your solution? I'm not familiar with perl syntax. Thank you!! – DPR Sep 30 '16 at 12:23
  • I have written a lot of code in gawk(big gawk blocks). Now looks like I have to change all of them to perl. Is there a small tweak that I can do to make the old scripts work? – DPR Sep 30 '16 at 12:40
  • there are similarities, but I think it'll have to re-written.. there is a [Awk to Perl translator](http://perldoc.perl.org/a2p.html) but if that doesn't work, perhaps you can workaround with `gawk` itself? – Sundeep Sep 30 '16 at 12:44
  • 1
    Thank you so much!! I will have to find a way in gawk. – DPR Sep 30 '16 at 12:58
0

I'd use a real CSV parser. I like ruby's:

ruby -rcsv -e '
  opts = { :col_sep => "|", :force_quotes => true }
  CSV.parse(STDIN.read, opts) do |row|
    row.delete_at(1)
    puts row.to_csv(opts)
  end
' filename

outputs

"Boolean"|"123"
glenn jackman
  • 238,783
  • 38
  • 220
  • 352