sed: replace spaces within quotes with underscores

Question

I have input (for example, from ifconfig run0 scan on OpenBSD) that has some fields that are separated by spaces, but some of the fields themselves contain spaces (luckily, such fields that contain spaces are always enclosed in quotes).

I need to distinguish between the spaces within the quotes, and the separator spaces. The idea is to replace spaces within quotes with underscores.

Sample data:

%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3
nwid Websense chan 6 bssid 00:22:7f:xx:xx:xx 59dB 54M short_preamble,short_slottime
nwid ZyXEL chan 8 bssid cc:5d:4e:xx:xx:xx 5dB 54M privacy,short_slottime
nwid "myTouch 4G Hotspot" chan 11 bssid d8:b3:77:xx:xx:xx 49dB 54M privacy,short_slottime

Which doesn't end up processed the way I want, since I haven't replaced the spaces within the quotes with the underscores yet:

%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 |\
    cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4
"myTouch Hotspot" 11 bssid d8:b3:77:xx:xx:xx
ZyXEL 8 cc:5d:4e:xx:xx:xx 5dB 54M
Websense 6 00:22:7f:xx:xx:xx 59dB 54M

Try AWK, it might be your solution instead of sed. http://stackoverflow.com/questions/3458699/how-to-use-awk-to-extract-a-quoted-field — Ricardo Ortega Magaña, Feb 16 '13 at 23:10
Yes, I think I'll have to use `awk`. But I still want to replace spaces within the quotes with underscores, as part of the final processing. — cnst, Feb 16 '13 at 23:18
Check the SUB part of this: http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_92.html you can mix both links i gave you to solve your problem. — Ricardo Ortega Magaña, Feb 16 '13 at 23:27
@cnst: Perl would be more appropriate than `awk` or `sed`. Much more scalable too. — Steve, Feb 17 '13 at 08:11

score 5 · Accepted Answer · answered Feb 17 '13 at 01:57

5

For a sed-only solution (which I don't necessarily advocate), try:

echo 'a b "c d e" f g "h i"' |\
sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta'
a b "c_d_e" f g "h_i"

Translation:

Start at the beginning of the line.
Look for the pattern junk"junk", repeated zero or more times, where junk doesn't have a quote, followed by junk"junk space.
Replace the final space with _.
If successful, jump back to the beginning.

answered Feb 17 '13 at 01:57

Joseph Quinsey

9,553
10
54
77

It actually works! :-) Even with an old sed on OpenBSD 4.6 that had no `-E` option yet! But why do you have to escape the parenthesis? (Although I tried replacing `\(` with `(`, and it stops working.) Also, why don't you have to include a space in the second `[]`, e.g. not `"[^" ]*"` instead of `"[^"]*"`? How does it know not to be greedy? Other than that, the regular expression itself makes perfect sense! :) So, `:a` is label `a`, and `ta` is jump to `a`? And jumping means rewinding the line to which search/replace was applied? Nifty! I'll have to put that into my arsenal. :-) – cnst Feb 17 '13 at 04:22
@cnst the replacement works backwards. To see the indiviual steps (GNU sed) place the command `l0` following the substitution command. i.e. `:a;s/.../.../;l0;ta` – potong Feb 17 '13 at 13:32
@potong, great other option, `l0` doesn't work in my `sed`, but just an `l`, as in `;l;ta`, seems to work great, indeed showing that it's processed greedy and backwards. Would it be better, in such case, to instead also avoid space as to make it not greedy? – cnst Feb 17 '13 at 17:50

score 4 · Answer 2 · edited Feb 17 '13 at 10:45

4

try this:

awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" file

it works for multi quotation parts in a line:

echo '"first part" foo "2nd part" bar "the 3rd part comes" baz'| awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" 
"first_part" foo "2nd_part" bar "the_3rd_part_comes" baz

EDIT alternative form:

awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' file

edited Feb 17 '13 at 10:45

Ed Morton

188,023
17
78
185

answered Feb 16 '13 at 23:35

Kent

189,393
32
233
301

hm, doesn't work in my tcsh: `cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 | awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" | cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4` returns `Unmatched ".` – cnst Feb 16 '13 at 23:40
OK, this works great in tcsh (just changed some " to '): `cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 | awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS='"' | cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4` – cnst Feb 16 '13 at 23:45
I disagree that `awk`/`sed` are appropriate for this task, but that doesn't mean it can't be done. If you were going to use `awk`, you could do away with the `if` statement. Just use `i+=2` and `i – Steve Feb 17 '13 at 08:08
1

+1 for the approach but change `i++` to `i+=2` and get rid of `if(i%2==0)` and the spurious trailing `;` after the `gsub()`. Also, if you want the FS and OFS to have the same values, it's clearest to assign them both the same value in the BEGIN section as `BEGIN{FS=OFS="""}`. – Ed Morton Feb 17 '13 at 10:41

Scrutinizer · Answer 3 · 2013-02-21T06:38:36.250

Another awk to try:

awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\"

Removing the quotes:

awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=

Some additional testing with a triple size test file further to the earlier tests done by @steve. I had to transform the sed statement a little bit so that non-GNU seds could process it as well. I included awk (bwk) gawk3, gawk4 and mawk:

$ for i in {1..1500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' ; done > test
$ time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null

real    0m27.802s
user    0m27.588s
sys 0m0.177s
$ time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m6.565s
user    0m6.500s
sys 0m0.059s
$ time gawk3 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m21.486s
user    0m18.326s
sys 0m2.658s
$ time gawk4 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m14.270s
user    0m14.173s
sys 0m0.083s
$ time mawk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m4.251s
user    0m4.193s
sys 0m0.053s
$ time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m13.229s
user    0m13.141s
sys 0m0.075s
$ time gawk3 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m33.965s
user    0m26.822s
sys 0m7.108s
$ time gawk4 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m15.437s
user    0m15.328s
sys 0m0.087s
$ time mawk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m4.002s
user    0m3.948s
sys 0m0.051s
$ time sed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null

real    5m14.008s
user    5m13.082s
sys 0m0.580s
$ time gsed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null

real    4m11.026s
user    4m10.318s
sys 0m0.463s

mawk rendered the fastest results...

nice one! both work great, and seem to be the shortest solution to the question, even shorter than the shortest `perl` snippet by @Steve (although less readable at that). i need to drop `sed`, and learn `awk`! — cnst, Feb 18 '13 at 21:00

Steve · Answer 4 · 2013-02-19T03:11:10.180

2

You'd be better off with perl. The code is much more readable and maintainable:

perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge'

With your input, the results are:

a b "c_d_e" f g "h_i"

Explanation:

-p            # enable printing
-e            # the following expression...

s             # begin a substitution

:             # the first substitution delimiter

"[^"]*"      # match a double quote followed by anything not a double quote any
              # number of times followed by a double quote

:             # the second substitution delimiter

($x=$&)=~s/ /_/g;      # copy the pattern match ($&) into a variable ($x), then 
                       # substitute a space for an underscore globally on $x. The
                       # variable $x is needed because capture groups and
                       # patterns are read only variables.

$x            # return $x as the replacement.

:             # the last delimiter

g             # perform the nested substitution globally
e             # make sure that the replacement is handled as an expression

Some testing:

for i in {1..500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' >> test; done

time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null

real    0m8.301s
user    0m8.273s
sys     0m0.020s

time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m4.967s
user    0m4.924s
sys     0m0.036s

time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m4.336s
user    0m4.244s
sys     0m0.056s

time sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test >/dev/null

real    2m26.101s
user    2m25.925s
sys     0m0.100s

edited Feb 19 '13 at 03:11

answered Feb 17 '13 at 08:00

Steve

51,466
13
89
103

1

I'm sorry but I have to disagree on that code being more readable than anything. Others will obviously disagree but it's completely incomprehensible to me at least and I have honestly tried to figure it out. Would you mind adding an explanation of what it's doing? – Ed Morton Feb 17 '13 at 10:48
Thanks that helps a lot. The only things I still don't understand are - 1) what does "=~" mean as opposed to just "=", 2) what does "make sure that the replacement is handled as an expression" mean?, and 3) when you say "return $x as the replacement" - return it to what (is that like assigning to $0 in awk)? – Ed Morton Feb 17 '13 at 12:07
1

@EdMorton: No problem. Glad I could help. 1) `=~` just means "run this variable against this regular expression". 2) Perl's `e` flag, is just like seds `e` flag. In the parent substitution, the replacement value is a second (child) substitution. By default, Perl doesn't expect this. So the `e` flag is needed. – Steve Feb 17 '13 at 12:12
1

@EdMorton: 3) I mean, let the parent replacement _be_ `$x`. – Steve Feb 17 '13 at 12:17
this works great, and seems like the most higher-level abstracted solution the concept for which I've had in mind: find the expression within quotes, then do a search/replace within such expression. Would you mind elaborating a little bit more on how it works underneath: how does it ensure that it'll match odd quotes as opening, and even as closing, and never otherwise? Is it based on a concept that the whole input right from the start has to go through the regexp, and it makes exactly one pass? For input of about 20 lines, would this `perl` still be more efficient than `sed` or `awk`? :-) – cnst Feb 17 '13 at 18:34
@cnst for input of 20 lines any solution you choose will run in the blink of an eye so I wouldn't worry about that. I wouldn't assume any of the posted solutions is more efficient than any other though as they all look pretty similar in complexity and number of operations to me - they're all searching each line for a regexp and substituting underscores for spaces when they find a matching string. – Ed Morton Feb 17 '13 at 22:41
@cnst: It simply finds a double quote, then anything not a double quote and then a double quote. It works from left to right, translating spaces to underscores as it goes. I've added some testing. Generally speaking, more complex regular expressions take longer to process. As expected, the `sed` regex takes the most time to process. HTH. – Steve Feb 19 '13 at 03:08
1

@EdMorton: Posted some interesting timings if your interested. However, I think I still prefer the `perl`. I think it better describes what's actually happening. But in a time-critical pipeline, and after seeing these results, I'd trade off the readability and go with the `awk`. – Steve Feb 19 '13 at 03:16
@Steve, very nice job at testing! indeed it's not surprising that sed is the slowest, considering that it goes over each line several times; however, it is a little strange that it's just so slow compared to other options, perhaps some optimisation in the expression is possible? what if you make it non-greedy as per my suggestion in that answer? anyhow, as per overall, i had another test in mind: for input that's only 80 characters per line, by 20 lines long, which one would be the fastest? :) – cnst Feb 19 '13 at 06:36
@cnst: I think I'll let you continue testing... Generally, you'll need a large amount of input. Use at least 20 MB of data. Goodluck. – Steve Feb 19 '13 at 06:53
@Steve, yeah, what i was thinking is several thousand runs over small amount of input. although results might highly depend on the operating system and specific perl / sed / awk versions etc. my wild guess, prior to your large-input testing, was that perl, being a fully-featured language, would have been slower on small input than sed, but i guess that might not be the case at all! – cnst Feb 19 '13 at 06:59
@Steve thanks for running the test and posting the result. I'm surprised perl is twice as slow as awk and that sed is SO much slower. Oh well. I find the awk solution clearest so I'd use that anyway. By the way, the 2nd awk solution you posted would add a trailing newline and " at the end of the input so it's not quite a solution. – Ed Morton Feb 19 '13 at 14:15
@Steve, thanks for the tests! Out of interest would you happen to have `mawk` at your disposal and could you retest the awk suggestions using `mawk`? – Scrutinizer Feb 20 '13 at 06:53
@Scrutinizer: Sorry mate. I don't have `mawk` on any of my machines. I've included instructions as to how I created my test file above. The results we're a little variable with half a million lines, so you may want to at least triple that number in your testing. Please let me know how it goes though. I suspect that `mawk` will be faster still. Goodluck. – Steve Feb 20 '13 at 07:08
1

OK, I compiled mawk, gawk3, gawk4 and GNU sed and added them to my system and then ran some further tests. I added the results to the end of my post. – Scrutinizer Feb 20 '13 at 14:44

score 1 · Answer 5 · answered Feb 17 '13 at 12:41

NOT AN ANSWER, just posting awk equivalent code for @steve's perl code in case anyone's interested (and to help me remember this in future):

@steve posted:

perl -pe 's:"[^\"]*":($x=$&)=~s/ /_/g;$x:ge'

and from reading @steve's explanation the briefest awk equivalent to that perl code (NOT the preferred awk solution - see @Kent's answer for that) would be the GNU awk:

gawk '{
   head = ""
   while ( match($0,"\"[^\"]*\"") ) {
      head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
      $0 = substr($0,RSTART+RLENGTH)
   }
   print head $0
}'

which we get to by starting from a POSIX awk solution with more variables:

awk '{
   head = ""
   tail = $0
   while ( match(tail,"\"[^\"]*\"") ) {
      x = substr(tail,RSTART,RLENGTH)
      gsub(/ /,"_",x)
      head = head substr(tail,1,RSTART-1) x
      tail = substr(tail,RSTART+RLENGTH)
   }
   print head tail
}'

and saving a line with GNU awk's gensub():

gawk '{
   head = ""
   tail = $0
   while ( match(tail,"\"[^\"]*\"") ) {
      x = gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
      head = head substr(tail,1,RSTART-1) x
      tail = substr(tail,RSTART+RLENGTH)
   }
   print head tail
}'

and then getting rid of the variable x:

gawk '{
   head = ""
   tail = $0
   while ( match(tail,"\"[^\"]*\"") ) {
      head = head substr(tail,1,RSTART-1) gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
      tail = substr(tail,RSTART+RLENGTH)
   }
   print head tail
}'

and then getting rid of the variable "tail" if you don't need $0, NF, etc, left hanging around after the loop:

gawk '{
   head = ""
   while ( match($0,"\"[^\"]*\"") ) {
      head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
      $0 = substr($0,RSTART+RLENGTH)
   }
   print head $0
}'

sed: replace spaces within quotes with underscores

5 Answers5

Linked