2

I found an URL parser regular expression at RFC 2396 and RFC 3986.

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

I converted it to Ragel:

%%{    
  # RFC 3986 URI Generic Syntax (January 2005)
  machine url_parser;

  action pchar     {
    printf("%c", fc);
  }
  action scheme            { printf("scheme\n"); }
  action scheme_end        { printf("\nscheme_end\n"); }
  action authority         { printf("authority\n"); }
  action authority_end     { printf("\nauthority_end\n"); }
  action path              { printf("path\n"); }
  action path_end          { printf("\npath_end\n"); }
  action query             { printf("query\n"); }
  action query_end         { printf("\nquery_end\n"); }
  action fragment          { printf("fragment\n"); }
  action fragment_end      { printf("\nfragment_end\n"); }

  scheme    = (any - [:/?#])+ >scheme    $pchar %scheme_end ;
  authority = (any - [/?#])*  >authority $pchar %authority_end ;
  path      = (any - [?#])*   >path      $pchar %path_end ;
  query     = (any - [#])*    >query     $pchar %query_end ;
  fragment  = (any)*          >fragment  $pchar %fragment_end ; 
  main     := (( scheme ":" )?) <: (( "//" authority )?) <: path ( "?" query )? ( "#" fragment )?;
}%%

#include <cstdio>
#include <cstdlib>
#include <string>

/** Data **/
%% write data;

int main(int argc, char **argv) {
  std::string str(argv[1]);
  char const* p = str.c_str();
  char const* pe = p + str.size();
  char const* eof = pe;
  int cs = 0;

  %% write init;
  %% write exec;

  return p - str.c_str();
}

It's work when I input absolute URI.

liangxu@dev64:~$ ./uri_test "http://www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
scheme
http
scheme_end
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end

And success when I input authority and path:

liangxu@dev64:~$ ./uri_test "//www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end

But failed when I input only path:

liangxu@dev64:~$ ./uri_test "/pub/ietf/uri"

What's wrong?

lxu4net
  • 2,846
  • 1
  • 16
  • 9
  • The trouble seems to be in the "authority" section, if I remove that it can find the path. Now as why the authority section is not working... – teambob Jul 19 '12 at 22:39

2 Answers2

0

U use the wrong guardian <: ,once the authority section see your first / , the control gived to authority section.

It makes clear if u see the alias of <: which is

expr $(unique_name,1) . expr >(unique_name,0)

It means, on every transition state that match on the left expr, it will hold the HIGHER prioritize, avoiding the right expression.

Much easier if u convert the ABNF notation to ragel.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
bdnt
  • 71
  • 5
0

I did the same thing myself recently, you can have a look at my ragel grammar https://github.com/maximecaron/ragel-url-parser

skyde
  • 2,816
  • 4
  • 34
  • 53