0

I'm having trouble parsing some HTTP headers using c++. Right now I'd like to be able to find the carriage return/linefeed combination that ends each HTTP header entry. I'm doing this with str.find() like so:

string hdr; //filled with the header data
int line_end_pos = hdr.find("\r\n"); //also tried "\\r\\n", same results

Despite knowing that the header has the combination of a carriage return and a linefeed character, find() keeps returning -1. What am I missing here?

EDIT:

The library I'm using offers a couple of different functions for displaying the data. A sample of the header data looks like this in string format:

GET /p/libcrafter/ HTTP/1.1
Host: code.google.com
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en,en-us;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Cookie: PREF=ID=ad8fd3ab4b0bd3c9:U=e1bd88556eeb2dce:FF=0:TM=1382531357:LM=1382531841:S=Pbh-JiokGeVbsSh-; NID=67=olK2k5sUZ95mRApV77s7CfXscytJSfmVuyubiSCMotOdBBvijqrTwyyifLQZbZA_SCTVQXqTEoE6hqaqVJkRpqoY2RPDFBPghbe5czX6QxKw7lBdOaP6-IpzGXYMWl6Q; OGPC=4061029-5:; __utma=247248150.2068354019.1382532826.1382532826.1382532826.1; __utmb=247248150.10.10.1382532826; __utmc=247248150; __utmz=247248150.1382532826.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
Connection: keep-alive
Cache-Control: max-age=0

It looks like this in "Hex Dump" format:

  47455420 2F702F6C 69626372 61667465  GET /p/libcrafte 00000000
  722F2048 5454502F 312E310D 0A486F73  r/ HTTP/1.1..Hos 00000010
  743A2063 6F64652E 676F6F67 6C652E63  t: code.google.c 00000020
  6F6D0D0A 55736572 2D416765 6E743A20  om..User-Agent:  00000030
  4D6F7A69 6C6C612F 352E3020 28583131  Mozilla/5.0 (X11 00000040
  3B205562 756E7475 3B204C69 6E757820  ; Ubuntu; Linux  00000050
  7838365F 36343B20 72763A32 342E3029  x86_64; rv:24.0) 00000060
  20476563 6B6F2F32 30313030 31303120   Gecko/20100101  00000070
  46697265 666F782F 32342E30 0D0A4163  Firefox/24.0..Ac 00000080
  63657074 3A207465 78742F68 746D6C2C  cept: text/html, 00000090
  6170706C 69636174 696F6E2F 7868746D  application/xhtm 000000A0
  6C2B786D 6C2C6170 706C6963 6174696F  l+xml,applicatio 000000B0
  6E2F786D 6C3B713D 302E392C 2A2F2A3B  n/xml;q=0.9,*/*; 000000C0
  713D302E 380D0A41 63636570 742D4C61  q=0.8..Accept-La 000000D0
  6E677561 67653A20 656E2C65 6E2D7573  nguage: en,en-us 000000E0
  3B713D30 2E350D0A 41636365 70742D45  ;q=0.5..Accept-E 000000F0
  6E636F64 696E673A 20677A69 702C2064  ncoding: gzip, d 00000100
  65666C61 74650D0A 444E543A 20310D0A  eflate..DNT: 1.. 00000110
  436F6F6B 69653A20 50524546 3D49443D  Cookie: PREF=ID= 00000120
  61643866 64336162 34623062 64336339  ad8fd3ab4b0bd3c9 00000130
  3A553D65 31626438 38353536 65656232  :U=e1bd88556eeb2 00000140
  6463653A 46463D30 3A544D3D 31333832  dce:FF=0:TM=1382 00000150
  35333133 35373A4C 4D3D3133 38323533  531357:LM=138253 00000160
  31383431 3A533D50 62682D4A 696F6B47  1841:S=Pbh-JiokG 00000170
  65566273 53682D3B 204E4944 3D36373D  eVbsSh-; NID=67= 00000180
  6F6C4B32 6B357355 5A39356D 52417056  olK2k5sUZ95mRApV 00000190
  37377337 43665873 6379744A 53666D56  77s7CfXscytJSfmV 000001A0
  75797562 6953434D 6F744F64 42427669  uyubiSCMotOdBBvi 000001B0
  6A717254 77797969 664C515A 625A415F  jqrTwyyifLQZbZA_ 000001C0
  53435456 51587154 456F4536 68716171  SCTVQXqTEoE6hqaq 000001D0
  564A6B52 70716F59 32525044 46425067  VJkRpqoY2RPDFBPg 000001E0
  68626535 637A5836 51784B77 376C4264  hbe5czX6QxKw7lBd 000001F0
  4F615036 2D49707A 4758594D 576C3651  OaP6-IpzGXYMWl6Q 00000200
  3B204F47 50433D34 30363130 32392D35  ; OGPC=4061029-5 00000210
  3A3B205F 5F75746D 613D3234 37323438  :; __utma=247248 00000220
  3135302E 32303638 33353430 31392E31  150.2068354019.1 00000230
  33383235 33323832 362E3133 38323533  382532826.138253 00000240
  32383236 2E313338 32353332 3832362E  2826.1382532826. 00000250
  313B205F 5F75746D 623D3234 37323438  1; __utmb=247248 00000260
  3135302E 31302E31 302E3133 38323533  150.10.10.138253 00000270
  32383236 3B205F5F 75746D63 3D323437  2826; __utmc=247 00000280
  32343831 35303B20 5F5F7574 6D7A3D32  248150; __utmz=2 00000290
  34373234 38313530 2E313338 32353332  47248150.1382532 000002A0
  3832362E 312E312E 75746D63 73723D28  826.1.1.utmcsr=( 000002B0
  64697265 6374297C 75746D63 636E3D28  direct)|utmccn=( 000002C0
  64697265 6374297C 75746D63 6D643D28  direct)|utmcmd=( 000002D0
  6E6F6E65 290D0A43 6F6E6E65 6374696F  none)..Connectio 000002E0
  6E3A206B 6565702D 616C6976 650D0A43  n: keep-alive..C 000002F0
  61636865 2D436F6E 74726F6C 3A206D61  ache-Control: ma 00000300
  782D6167 653D300D 0A0D0A             x-age=0....      00000310

Finally, it looks like this as a "Raw String":

\x47\x45\x54\x20\x2f\x70\x2f\x6c\x69\x62\x63\x72\x61\x66\x74\x65\x72\x2f\x20\x48
\x54\x54\x50\x2f\x31\x2e\x31\xd\xa\x48\x6f\x73\x74\x3a\x20\x63\x6f\x64\x65\x2e\x67
\x6f\x6f\x67\x6c\x65\x2e\x63\x6f\x6d\xd\xa\x55\x73\x65\x72\x2d\x41\x67\x65\x6e\x74
\x3a\x20\x4d\x6f\x7a\x69\x6c\x6c\x61\x2f\x35\x2e\x30\x20\x28\x58\x31\x31\x3b\x20\x55
\x62\x75\x6e\x74\x75\x3b\x20\x4c\x69\x6e\x75\x78\x20\x78\x38\x36\x5f\x36\x34\x3b\x20
\x72\x76\x3a\x32\x34\x2e\x30\x29\x20\x47\x65\x63\x6b\x6f\x2f\x32\x30\x31\x30\x30\x31
\x30\x31\x20\x46\x69\x72\x65\x66\x6f\x78\x2f\x32\x34\x2e\x30\xd\xa\x41\x63\x63\x65\x70
\x74\x3a\x20\x74\x65\x78\x74\x2f\x68\x74\x6d\x6c\x2c\x61\x70\x70\x6c\x69\x63\x61\x74
\x69\x6f\x6e\x2f\x78\x68\x74\x6d\x6c\x2b\x78\x6d\x6c\x2c\x61\x70\x70\x6c\x69\x63\x61
\x74\x69\x6f\x6e\x2f\x78\x6d\x6c\x3b\x71\x3d\x30\x2e\x39\x2c\x2a\x2f\x2a\x3b\x71\x3d
\x30\x2e\x38\xd\xa\x41\x63\x63\x65\x70\x74\x2d\x4c\x61\x6e\x67\x75\x61\x67\x65\x3a\x20
\x65\x6e\x2c\x65\x6e\x2d\x75\x73\x3b\x71\x3d\x30\x2e\x35\xd\xa\x41\x63\x63\x65\x70\x74
\x2d\x45\x6e\x63\x6f\x64\x69\x6e\x67\x3a\x20\x67\x7a\x69\x70\x2c\x20\x64\x65\x66\x6c\x61
\x74\x65\xd\xa\x44\x4e\x54\x3a\x20\x31\xd\xa\x43\x6f\x6f\x6b\x69\x65\x3a\x20\x50\x52
\x45\x46\x3d\x49\x44\x3d\x61\x64\x38\x66\x64\x33\x61\x62\x34\x62\x30\x62\x64\x33\x63
\x39\x3a\x55\x3d\x65\x31\x62\x64\x38\x38\x35\x35\x36\x65\x65\x62\x32\x64\x63\x65\x3a
\x46\x46\x3d\x30\x3a\x54\x4d\x3d\x31\x33\x38\x32\x35\x33\x31\x33\x35\x37\x3a\x4c\x4d
\x3d\x31\x33\x38\x32\x35\x33\x31\x38\x34\x31\x3a\x53\x3d\x50\x62\x68\x2d\x4a\x69\x6f
\x6b\x47\x65\x56\x62\x73\x53\x68\x2d\x3b\x20\x4e\x49\x44\x3d\x36\x37\x3d\x6f\x6c\x4b
\x32\x6b\x35\x73\x55\x5a\x39\x35\x6d\x52\x41\x70\x56\x37\x37\x73\x37\x43\x66\x58\x73
\x63\x79\x74\x4a\x53\x66\x6d\x56\x75\x79\x75\x62\x69\x53\x43\x4d\x6f\x74\x4f\x64\x42
\x42\x76\x69\x6a\x71\x72\x54\x77\x79\x79\x69\x66\x4c\x51\x5a\x62\x5a\x41\x5f\x53\x43
\x54\x56\x51\x58\x71\x54\x45\x6f\x45\x36\x68\x71\x61\x71\x56\x4a\x6b\x52\x70\x71\x6f
\x59\x32\x52\x50\x44\x46\x42\x50\x67\x68\x62\x65\x35\x63\x7a\x58\x36\x51\x78\x4b\x77
\x37\x6c\x42\x64\x4f\x61\x50\x36\x2d\x49\x70\x7a\x47\x58\x59\x4d\x57\x6c\x36\x51\x3b
\x20\x4f\x47\x50\x43\x3d\x34\x30\x36\x31\x30\x32\x39\x2d\x35\x3a\x3b\x20\x5f\x5f\x75
\x74\x6d\x61\x3d\x32\x34\x37\x32\x34\x38\x31\x35\x30\x2e\x32\x30\x36\x38\x33\x35\x34
\x30\x31\x39\x2e\x31\x33\x38\x32\x35\x33\x32\x38\x32\x36\x2e\x31\x33\x38\x32\x35\x33
\x32\x38\x32\x36\x2e\x31\x33\x38\x32\x35\x33\x32\x38\x32\x36\x2e\x31\x3b\x20\x5f\x5f
\x75\x74\x6d\x62\x3d\x32\x34\x37\x32\x34\x38\x31\x35\x30\x2e\x31\x30\x2e\x31\x30\x2e
\x31\x33\x38\x32\x35\x33\x32\x38\x32\x36\x3b\x20\x5f\x5f\x75\x74\x6d\x63\x3d\x32\x34
\x37\x32\x34\x38\x31\x35\x30\x3b\x20\x5f\x5f\x75\x74\x6d\x7a\x3d\x32\x34\x37\x32\x34
\x38\x31\x35\x30\x2e\x31\x33\x38\x32\x35\x33\x32\x38\x32\x36\x2e\x31\x2e\x31\x2e\x75
\x74\x6d\x63\x73\x72\x3d\x28\x64\x69\x72\x65\x63\x74\x29\x7c\x75\x74\x6d\x63\x63\x6e
\x3d\x28\x64\x69\x72\x65\x63\x74\x29\x7c\x75\x74\x6d\x63\x6d\x64\x3d\x28\x6e\x6f\x6e
\x65\x29\xd\xa\x43\x6f\x6e\x6e\x65\x63\x74\x69\x6f\x6e\x3a\x20\x6b\x65\x65\x70\x2d\x61
\x6c\x69\x76\x65\xd\xa\x43\x61\x63\x68\x65\x2d\x43\x6f\x6e\x74\x72\x6f\x6c\x3a\x20\x6d
\x61\x78\x2d\x61\x67\x65\x3d\x30\xd\xa\xd\xa

As you can see, when outputted in hex format the lines end with 0D and 0A and when in raw string format they end with \xd and \xa. My question remains though, how can I find these end-of-line characters when working with the data as a string (or can't I)?

amoeba
  • 95
  • 4
  • 17
  • Looks correct, have you verified (in a debugger, or log) that the string doesn't contain a \0 before the \r\n – Dweeberly Oct 23 '13 at 02:45
  • There shouldn't be. I tried adding it just to see with the call hdr.find("\0\r\n") but still got the same unexpected result. – amoeba Oct 23 '13 at 02:55
  • How did you initialize hdr? Were you using an input method that maps carriage-return/line-feed to a newline character? Also note that `\r\n` isn't entirely portable, though it is likely to work on most implementations. See http://stackoverflow.com/questions/1279779/what-is-the-difference-between-r-and-n/9549183#9549183 – Adrian McCarthy Oct 23 '13 at 03:11
  • It's initialized by a function that dumps a page's entire http content into a string. – amoeba Oct 23 '13 at 03:16
  • Something else is in that string terminating the search, as the code you posted will work with the string your provided [See It Live](http://ideone.com/DDVtcg). – WhozCraig Oct 23 '13 at 08:53
  • I suspect that the function that's reading the input into the string is stripping the carriage returns and leaving the line feeds. – Adrian McCarthy Oct 23 '13 at 21:55

1 Answers1

0

The output of the following program is 35

#include <iostream>
using namespace std;

int main()
{
    string hdr = "Date: Wed, 23 Oct 2013 02:20:30 GMT\r\nServer: Apache\r\n"; 
    int line_end_pos = hdr.find("\r\n");
    cout << line_end_pos;
}

If we then modify this code, such that it is now:

#include <iostream>
#include <fstream>
using namespace std;

int main()
{
    string hdr = "Date: Wed, 23 Oct 2013 02:20:30 GMT\r\nServer: Apache\r\n"; 

    int line_end_pos = hdr.find("\r\n");
    cout << line_end_pos;

    fstream output;
    output.open("test.txt", std::fstream::out);

    output << hdr;
    output.close();
}

We get a file with the contents of hdr. Upon viewing it with a hex-editor, it can be seen that some transformation of the input has occurred. Between GMT and Server, we expect to see two characters - 0x0D and 0x0A. However, we see that test.txt actually has 3 characters - 0x0D, 0x0D, 0x0A. The file is also 55 bytes(characters) in length, when the input string was 53 bytes(characters) long.

If we bitwise-or the flag std::fstream::binary with std::fstream::out,

output.open("test.txt", std::fstream::out | std::fstream::binary);

then the output is an identical copy of the string held in hdr. I.e 53 bytes long, single 0x0d, 0x0a between lines.

EDIT: Also, it's worth pointing out that unix and windows-based systems have different end-of-line conventions. I wrote this code under windows.

Sooooo, I suggest you save a copy of the header and examine it with a hex editor - until you do that or use a debugger, you're not going to be in a position to know what the problem is. I generally find that treating text-input as binary-input to be the safest - since there's no translation of end-of-line characters.

EDIT 2: Do you get a result of 26 when you run this? If so, I'm afraid I'm out of ideas just now. I'll consider your question further when I'm fresh, in the morning.

#include <iostream>

using namespace std;

int main()
{
    char rawData[] =
    {
        0x47,0x45,0x54,0x20, 0x2F,0x70,0x2F,0x6C, 0x69,0x62,0x63,0x72, 0x61,0x66,0x74,0x65,
        0x72,0x2F,0x20,0x48, 0x54,0x54,0x50,0x2F, 0x31,0x2E,0x31,0x0D, 0x0A,0x48,0x6F,0x73,
        0x74,0x3A,0x20,0x63, 0x6F,0x64,0x65,0x2E, 0x67,0x6F,0x6F,0x67, 0x6C,0x65,0x2E,0x63
    };
    string hdr = rawData;
    int newLinePos = hdr.find("\r\n");
    cout << newLinePos;
}
enhzflep
  • 12,927
  • 2
  • 32
  • 51
  • I took a look here at the file in hex and each line terminates with a combination of 0a and 0d (I'm working on a Linux machine). – amoeba Oct 23 '13 at 12:42
  • @amoeba - Thanks for the extra data, hopefully it will make it easier to diagnose. I'm still missing it(the problem), as you are. I've added a new snippet, based on your data. (which I suspect won't help, to be honest) – enhzflep Oct 23 '13 at 14:55