0

I'm parsing a roughly 10GB log file, and need to feed it through sed to capture some output. The necessary capture segment based on what I would use in JavaScript is:

s/method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)""/"\1","\2","\3"/

Unfortunately sed (GNU sed 4.2.1, GnuWin32 edition) is struggling over the [^"]* ranges. It refuses to match them. I've tried variations of other acceptance blocks, with [a-zA-Z0-9:\\/.]* and similar variants but there seem to always be new characters inside the block that it misses, and really I can accept any valid character held between the quotes. With sed's * routine being a greedy implementation it tends to also have problems on the final "accept" item, pulling in all the other items on the log entry right up until the end.

I need to capture everything between the quotation marks and ignore the rest of the log entry.

I've been at this for two days for some stupid thing I could have implemented directly in python if there wasn't a requirement it be executed from a script with sed. Can any regex guru out there help?

EDIT:

For the extra information about examples, this produces no matches on my system, sed 4.2.1 from the GnuWin32.sourceforge.net collection: sed -r 's/method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)""/"\1","\2","\3"/' logfile

This produces matches for some entries: sed -r 's/^.*\method\=""([A-Z]*).*path=""([a-zA-Z0-9:\/]*).*accept=""(.*)"".*/"\1","\2","\3"/ logfile

Here are some (slightly redacted but not too much) lines:

"server-01/1.2.3.4    time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""GET"" path=""/ourapp/foo/bar/AAA-123:1029"" status=""200"" message=""OK"" duration=""7"" query=""cc=1463648"" content_type=""application/json"" referer=""https://example.org/somewhere"" from=""foo@bar.com"" ip=""1.2.3.4"" agent=""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"" req_header_accept=""application/json, text/javascript, application/sord+xml; q=0.01"" req_header_accept-language=""en-US,en;q=0.8"" req_header_x-request-id=""29/Oct/2014:05:59:59.968a-abc123ABC"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"

"server-01/1.2.3.4    time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""GET"" path=""/ourapp/foo/bar:AA9.1/ABC-123/record"" status=""200"" message=""OK"" duration=""73"" query=""view=includeFields"" content_type=""application/json"" from=""None"" ip=""1.2.3.4"" req_header_accept=""application/json"" req_header_x-request-id=""ab123-abc123-12345abc"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"

"server-01/1.2.3.4    time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""HEAD"" path=""/ourapp/foo/bar:AA3.4/ABC-123/meta"" status=""200"" message=""OK"" duration=""21"" content_type=""application/json"" from=""foo@bar.com"" ip=""1.2.3.4"" agent=""Java/1.7.0_25"" req_header_accept=""application/json"" req_header_accept-language=""en"" req_header_cache-control=""no-cache"" req_header_x-request-id=""29/Oct/2014:05:59:59.882va-af527A"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"
Bryan
  • 152
  • 9
  • `[^"]*` works fine in GNU `sed`. Please show the _complete_ command you using to invoke `sed`. (For best results, also show some small sample input and the corresponding desired output.) – John1024 Nov 10 '14 at 22:09
  • Info added to original post, too long for comment. – Bryan Nov 10 '14 at 22:56

1 Answers1

2

The key to this problem turned out to Windows shell interactions with the sed command. See the last section in this answer for details.

Demonstration under a Unix shell

As sample input consider:

$ cat file
some method=""this is my method"" more stuff path=""My Path""  accept=""Yes"" end of line

The following sed command processes that input:

$ sed -r 's/.*method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)"".*/"\1","\2","\3"/' file
"this is my method","My Path","Yes"

Note that the -r option is required to so that unescaped parens act as grouping rather than literal characters.

Using the more complex input in the revised question:

$ sed -r 's/.*method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)"".*/"\1","\2","\3"/' input
"GET","/ourapp/foo/bar/AAA-123:1029","application/json, text/javascript, application/sord+xml; q=0.01"

"GET","/ourapp/foo/bar:/AA9.1/ABC-123/record","application/json"

"HEAD","/ourapp/foo/bar:/AA3.4/ABC-123/meta","application/json"

As regards the accept issue, I see two accept variables in the sample input:

req_header_accept
req_header_accept-language

Because the regex matches accept="", the former should be matched, not the latter.

Matching non-quotes

Consider the input:

$ cat test.txt
Billy "The Kid" Smith
Jimmy "The Fish" Stuart
Chuck "The Man" Norris

This sed command selects the quoted material:

$ sed -r 's/.*"([^"]*)".*/\1/' test.txt
The Kid
The Fish
The Man

All these test were done on GNU sed version 4.2.1 under linux.

Windows Shell Issues

The following are key points for making sed commands work on Windows:

  • Enclose sed commands in double quotes. Under the Windows shell, commands should be protected by double-quotes, not single quotes as Unix uses.

  • If a string needs to contain double-quotes, write them in hexadecimal coding as \x22.

  • Under Windows, an unquoted caret ^ is an escape character. This, however, does not affect us because, in our case, the ^ always appear inside a double-quoted string.

  • CygWin, if it is available, avoids Windows shell issues.

Thus, for the Billy The Kid input, try:

sed -r "s/.*\x22([^\x22]*)\x22.*/\1/" test.txt

Also, ^ is a Windows escape character but it reportedly only functions as such outside quotes. Thus, I left it as is in the above command.

For the full case, Bryan reports that the following works:

sed -r "s/^.*method\=\x22\x22([^\x22]*).*path=\x22\x22([^\x22]*).*req_header_accept=\x‌​22\x22([^\x22]*).*$/\x22\1\x22,\x22\2\x22,\x22\3\x22/" logfile
John1024
  • 109,961
  • 14
  • 137
  • 171
  • Thanks. I knew about the -r. Unfortunately it is the `[^"]*` that it seems sed is choking on. I spent several hours trying to track this down, and could not find a solution. For some reason it absolutely refuses to work with a "not quote" set. With three test lines: Billy "The Kid" Smith Jimmy "The Fish" Stuart Chuck "The Man" Norris This works, extracting stuff inside quotes: `sed -r 's/.*"([a-zA-Z ]*)".*/\1/' test.txt` but this gives no results: `sed -r 's/.*"([^"]*)".*/\1/' test.txt` – Bryan Nov 10 '14 at 23:07
  • Your `sed` command for the bill the kid input works for me (output in updated answer). I am using GNU sed on linux but `[^"]*` really should work in any `sed`, GNU or not. I am not familiar with Windows. Could there be some interaction with the Windows shell? – John1024 Nov 10 '14 at 23:13
  • 1
    @Bryan According to [this answer](http://stackoverflow.com/a/3331793/3030305h), there is a Windows shell problem. For one, `"` takes the role of `'` and, two, `^` is a Windows escape character. – John1024 Nov 10 '14 at 23:21
  • Thanks, that link is enough to get this stupid thing working. – Bryan Nov 11 '14 at 18:47
  • Woah, that's quite an edit to the answer. I'll mark the heavily rewritten answer as useful, but it was [this link](http://stackoverflow.com/a/3331793/3030305h) you provided that unlocked it, not the rewrite you provided in the answer. – Bryan Nov 11 '14 at 18:49
  • `sed -r "s/^.*\method\=\x22\x22([^\x22]*).*path=\x22\x22([^\x22]*).*req_header_accept=\x22\x22([^\x22]*).*$/\x22\1\x22,\x22\2\x22,\x22\3\x22/" logfile` – Bryan Nov 11 '14 at 19:01
  • OK. So `\x22` replaces `"`. What is the purpose of escaping `m`? `m` is not special in `sed`. By the way, because `sed` regexes are greedy, the initial `^` (start of line) and the final `$` (end of line) should not be necessary. – John1024 Nov 11 '14 at 19:04
  • The \m is because I was editing and revising the stupid string for many hours. At one point it was probably attached to something that did need to be escaped. Where it is now doesn't hurt anything. Thankfully it just churned for about an hour and gave the needed results. – Bryan Nov 12 '14 at 22:14
  • @Bryan Very good. I am pleased that you found something that works. – John1024 Nov 12 '14 at 22:39