0

I'm attempting to create a regex that captures both the HTTP status code as well as the body of a curl request. The regex pattern below works on multiple online sites, but won't match in a shell if-statement on my Mac's command line. Is my regex off or is there something else going on?

RESPONSE=$(curl -s -i -X GET http://www.google.com/)

# Match and capture the status code, match the headers, match two new lines, match and capture an optional body
re="^HTTP\/\d\.\d\s([\d]{3})[\w\d\s\W\D\S]*[\r\n]{2}([\w\d\s\W\D\S]*)?$"

if [[ "${RESPONSE}" =~ $re ]]; then
  echo "match"
  # Now do stuff with the captured groups, "${BASH_REMATCH[...]}"
else
  echo "no match"
fi

I'm also open to other ways of doing this (I'm targeting a machine running CentOS 5).

Community
  • 1
  • 1
Matthew Herbst
  • 29,477
  • 23
  • 85
  • 128
  • 2
    Write the body to a file and use the `-w` curl flag to have curl output just the status code to stdout? – Etan Reisner Apr 13 '16 at 11:25
  • Try using a basic regular expression like: `^HTTP/[0-9]\.[0-9] [0-9]{3} OK` Escaping periods and spaces is ok, but the other escape sequences are being interpreted literally. Don't think capturing the status code will work either. Might be better off with sed, awk, perl, etc. – Cole Tierney Apr 13 '16 at 12:02
  • @ColeTierney can you expand on that? Why would the other escape sequences such as `\d` or `\w` be interpreted literally? – Matthew Herbst Apr 13 '16 at 13:29
  • @EtanReisner that's how I currently do it, though I'm trying to give myself a scenario where I don't have write access to the system, thus this question – Matthew Herbst Apr 13 '16 at 13:40
  • I'm testing with bash 3.2.57(1)-release. Try the following 3 tests: `[[ " " =~ \s ]] && echo yes || echo no` (I get no), `[[ "\s" =~ \s ]] && echo yes || echo no` (I get yes), and `[[ " " =~ [[:blank:]] ]] && echo yes || echo no` (I get yes). – Cole Tierney Apr 13 '16 at 16:25
  • @ColeTierney I get the same - I have the exact same version of Bash (which make sense if we both have fully-updated Macs) – Matthew Herbst Apr 13 '16 at 16:39

2 Answers2

3

Since you are open to other solutions, too, you can try this out.

RESPONSE=$(curl -s -i -X GET http://www.google.com/)

HTTP_STATUS_CODE=`echo $RESPONSE | sed '
  /HTTP/ { 
    s/^HTTP[^ ]* //
    s/ .*$//
    q
  }
  D'`

BODY=`echo $RESPONSE | sed '
  /^.$/ {
    :body
    n
    b body
  }
  D'`

echo $HTTP_STATUS_CODE
echo $BODY

HTTP_STATUS_CODE is found in the first line starting with HTTP. Every non-space until the first space is removed and from the result ('302 Found') everything from first space till the end of the line is removed.

BODY starts at the first line matching a single char (lines before are deleted with 'D'). From here print every line until the end of the input.

  • I like the idea, though both parts are giving `sed: RE error: illegal byte sequence` in my OSX terminal. – Matthew Herbst Apr 13 '16 at 13:37
  • It works in my zsh and bash in OSX. I have no idea what the problem could be... I suggest to copy it to a text editor and show all non-printable chars. – derlarsschneider Apr 13 '16 at 13:46
  • I got the error fixed using the post linked by @neric in his answer. When running the above (with the fix, setting `LC_ALL=C` before each `sed`), the `HTTP_STATUS_CODE` comes through but the `BODY` seems to be empty. – Matthew Herbst Apr 14 '16 at 00:28
3

Same idea as @delarsschneider, slightly less complicated

RESPONSE=$(curl -s -i -X GET http://www.google.com/)

CODE=$(echo $RESPONSE | sed -n 's/HTTP.* \(.*\) .*/\1/p')

BODY=$(echo $RESPONSE | tr '\n' ' ' | sed -n 's/.*GMT *\(.*\)/\1/p')

echo $CODE
echo $BODY
Matthew Herbst
  • 29,477
  • 23
  • 85
  • 128
neric
  • 3,927
  • 3
  • 21
  • 25
  • Exact same error as below: `sed: RE error: illegal byte sequence`. No idea why. I'm literally copy/pasting the above into terminal. – Matthew Herbst Apr 13 '16 at 16:37
  • Hmm, did you see this: http://stackoverflow.com/questions/19242275/re-error-illegal-byte-sequence-on-mac-os-x – neric Apr 13 '16 at 16:50
  • Interesting. Putting `LC_ALL=C` before the `sed` for the `CODE` lets the command work but the output is wrong. `LC_ALL=C` before `sed` for the BODY the command still fails with the same error. – Matthew Herbst Apr 14 '16 at 00:31
  • Alright, by using `export LC_CTYPE=C export LANG=C` I get both commands to work. Contents of `BODY` is correct, but contents of `CODE` is not. Seems to contain some random JS instead of the HTTP Status Code – Matthew Herbst Apr 14 '16 at 01:07