I'm parsing a roughly 10GB log file, and need to feed it through sed to capture some output. The necessary capture segment based on what I would use in JavaScript is:
s/method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)""/"\1","\2","\3"/
Unfortunately sed (GNU sed 4.2.1, GnuWin32 edition) is struggling over the [^"]*
ranges. It refuses to match them. I've tried variations of other acceptance blocks, with [a-zA-Z0-9:\\/.]*
and similar variants but there seem to always be new characters inside the block that it misses, and really I can accept any valid character held between the quotes. With sed's * routine being a greedy implementation it tends to also have problems on the final "accept" item, pulling in all the other items on the log entry right up until the end.
I need to capture everything between the quotation marks and ignore the rest of the log entry.
I've been at this for two days for some stupid thing I could have implemented directly in python if there wasn't a requirement it be executed from a script with sed. Can any regex guru out there help?
EDIT:
For the extra information about examples, this produces no matches on my system, sed 4.2.1 from the GnuWin32.sourceforge.net collection: sed -r 's/method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)""/"\1","\2","\3"/' logfile
This produces matches for some entries: sed -r 's/^.*\method\=""([A-Z]*).*path=""([a-zA-Z0-9:\/]*).*accept=""(.*)"".*/"\1","\2","\3"/ logfile
Here are some (slightly redacted but not too much) lines:
"server-01/1.2.3.4 time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""GET"" path=""/ourapp/foo/bar/AAA-123:1029"" status=""200"" message=""OK"" duration=""7"" query=""cc=1463648"" content_type=""application/json"" referer=""https://example.org/somewhere"" from=""foo@bar.com"" ip=""1.2.3.4"" agent=""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"" req_header_accept=""application/json, text/javascript, application/sord+xml; q=0.01"" req_header_accept-language=""en-US,en;q=0.8"" req_header_x-request-id=""29/Oct/2014:05:59:59.968a-abc123ABC"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"
"server-01/1.2.3.4 time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""GET"" path=""/ourapp/foo/bar:AA9.1/ABC-123/record"" status=""200"" message=""OK"" duration=""73"" query=""view=includeFields"" content_type=""application/json"" from=""None"" ip=""1.2.3.4"" req_header_accept=""application/json"" req_header_x-request-id=""ab123-abc123-12345abc"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"
"server-01/1.2.3.4 time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""HEAD"" path=""/ourapp/foo/bar:AA3.4/ABC-123/meta"" status=""200"" message=""OK"" duration=""21"" content_type=""application/json"" from=""foo@bar.com"" ip=""1.2.3.4"" agent=""Java/1.7.0_25"" req_header_accept=""application/json"" req_header_accept-language=""en"" req_header_cache-control=""no-cache"" req_header_x-request-id=""29/Oct/2014:05:59:59.882va-af527A"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"