Easy way to parse a url in C++ cross platform?

Question

I need to parse a URL to get the protocol, host, path, and query in an application I am writing in C++. The application is intended to be cross-platform. I'm surprised I can't find anything that does this in the boost or POCO libraries. Is it somewhere obvious I'm not looking? Any suggestions on appropriate open source libs? Or is this something I just have to do my self? It's not super complicated but it seems like such a common task I am surprised there isn't a common solution.

C++ (and even more so C) isn't like some other languages. It's not the sort of thing where standard libraries just exist by default for everything under the sun. There might be some library in common usage, but from the perspective of standard libraries, language features, even OS-specific APIs like POSIX, it's assumed that you can do a lot yourself. — asveikau, Apr 11 '10 at 04:09
Im happy to build a wheel but dont see the point in building it if someone else has done it. Hence my question. Youre right, "There might be some library in common usage" - thats what I was asking. — Andrew Bucknell, Apr 11 '10 at 06:52
It's the sort of small utility you'd find in the big framework you codebase relies on. If it isn't there then it's a fun exercise in standard algorithms to write a small URL utility collection. — wilhelmtell, Apr 11 '10 at 07:02
To parse URLs using the `RFC 3986` standard, simply and without importing any new libraries, check out this answer to a related question: https://stackoverflow.com/a/31613265/1043704 — Lorien Brune, Sep 30 '17 at 01:42

score 33 · Accepted Answer · edited Jan 17 '14 at 17:41

33

There is a library that's proposed for Boost inclusion and allows you to parse HTTP URI's easily. It uses Boost.Spirit and is also released under the Boost Software License. The library is cpp-netlib which you can find the documentation for at http://cpp-netlib.github.com/ -- you can download the latest release from http://github.com/cpp-netlib/cpp-netlib/downloads .

The relevant type you'll want to use is boost::network::http::uri and is documented here.

edited Jan 17 '14 at 17:41

g3rv4

19,750
4
36
58

answered Apr 11 '10 at 09:56

Dean Michael

3,446
1
20
14

2

Add note for datedness reasons: This library no longer compiles as intended as a whole in 2020 with the latest versions of Boost, thanks to Boost deprecating get_io_service. However, you can still yank the relevant functionality from the carcass, as it's self-contained and doesn't depend on those parts of the library. – The_Sympathizer Nov 25 '20 at 01:54

Tom · Answer 2 · 2012-06-15T04:00:36.087

Wstring version of above, added other fields I needed. Could definitely be refined, but good enough for my purposes.

#include <string>
#include <algorithm>    // find

struct Uri
{
public:
std::wstring QueryString, Path, Protocol, Host, Port;

static Uri Parse(const std::wstring &uri)
{
    Uri result;

    typedef std::wstring::const_iterator iterator_t;

    if (uri.length() == 0)
        return result;

    iterator_t uriEnd = uri.end();

    // get query start
    iterator_t queryStart = std::find(uri.begin(), uriEnd, L'?');

    // protocol
    iterator_t protocolStart = uri.begin();
    iterator_t protocolEnd = std::find(protocolStart, uriEnd, L':');            //"://");

    if (protocolEnd != uriEnd)
    {
        std::wstring prot = &*(protocolEnd);
        if ((prot.length() > 3) && (prot.substr(0, 3) == L"://"))
        {
            result.Protocol = std::wstring(protocolStart, protocolEnd);
            protocolEnd += 3;   //      ://
        }
        else
            protocolEnd = uri.begin();  // no protocol
    }
    else
        protocolEnd = uri.begin();  // no protocol

    // host
    iterator_t hostStart = protocolEnd;
    iterator_t pathStart = std::find(hostStart, uriEnd, L'/');  // get pathStart

    iterator_t hostEnd = std::find(protocolEnd, 
        (pathStart != uriEnd) ? pathStart : queryStart,
        L':');  // check for port

    result.Host = std::wstring(hostStart, hostEnd);

    // port
    if ((hostEnd != uriEnd) && ((&*(hostEnd))[0] == L':'))  // we have a port
    {
        hostEnd++;
        iterator_t portEnd = (pathStart != uriEnd) ? pathStart : queryStart;
        result.Port = std::wstring(hostEnd, portEnd);
    }

    // path
    if (pathStart != uriEnd)
        result.Path = std::wstring(pathStart, queryStart);

    // query
    if (queryStart != uriEnd)
        result.QueryString = std::wstring(queryStart, uri.end());

    return result;

}   // Parse
};  // uri

Tests/Usage

Uri u0 = Uri::Parse(L"http://localhost:80/foo.html?&q=1:2:3");
Uri u1 = Uri::Parse(L"https://localhost:80/foo.html?&q=1");
Uri u2 = Uri::Parse(L"localhost/foo");
Uri u3 = Uri::Parse(L"https://localhost/foo");
Uri u4 = Uri::Parse(L"localhost:8080");
Uri u5 = Uri::Parse(L"localhost?&foo=1");
Uri u6 = Uri::Parse(L"localhost?&foo=1:2:3");

u0.QueryString, u0.Path, u0.Protocol, u0.Host, u0.Port....

One thing I found when crawling the internet is that URL found in the real world are commonly broken or malformed (yet most browser still understand them correctly). The biggest one is the query. Yes it should start with `?` in the real world it more often starts with `&`. — Martin York, Jun 30 '20 at 17:47
What does your code return on `ftp://user:passwd@example.com:1555/docs/Java&C++`? — Aleksey F., Nov 19 '21 at 03:04

wilhelmtell · Answer 3 · 2010-04-11T06:49:04.643

24

Terribly sorry, couldn't help it. :s

url.hh

#ifndef URL_HH_
#define URL_HH_    
#include <string>
struct url {
    url(const std::string& url_s); // omitted copy, ==, accessors, ...
private:
    void parse(const std::string& url_s);
private:
    std::string protocol_, host_, path_, query_;
};
#endif /* URL_HH_ */

url.cc

#include "url.hh"
#include <string>
#include <algorithm>
#include <cctype>
#include <functional>
using namespace std;

// ctors, copy, equality, ...

void url::parse(const string& url_s)
{
    const string prot_end("://");
    string::const_iterator prot_i = search(url_s.begin(), url_s.end(),
                                           prot_end.begin(), prot_end.end());
    protocol_.reserve(distance(url_s.begin(), prot_i));
    transform(url_s.begin(), prot_i,
              back_inserter(protocol_),
              ptr_fun<int,int>(tolower)); // protocol is icase
    if( prot_i == url_s.end() )
        return;
    advance(prot_i, prot_end.length());
    string::const_iterator path_i = find(prot_i, url_s.end(), '/');
    host_.reserve(distance(prot_i, path_i));
    transform(prot_i, path_i,
              back_inserter(host_),
              ptr_fun<int,int>(tolower)); // host is icase
    string::const_iterator query_i = find(path_i, url_s.end(), '?');
    path_.assign(path_i, query_i);
    if( query_i != url_s.end() )
        ++query_i;
    query_.assign(query_i, url_s.end());
}

main.cc

// ...
    url u("HTTP://stackoverflow.com/questions/2616011/parse-a.py?url=1");
    cout << u.protocol() << '\t' << u.host() << ...

edited Apr 11 '10 at 06:49

answered Apr 11 '10 at 06:17

wilhelmtell

57,473
20
96
131

2

Minor nitpick: You don't need to use ptr_fun here, and if you do, you need to `#include `. (you probably shouldn't `using namespace std` either but I'm assuming this isn't for production code) – Billy ONeal Apr 11 '10 at 06:27
I omitted some trivial functionality, like the assignment operator, constructors, accessors and so on. The `url` class shouldn't have mutators. For the equality operator, you might add a hash member that you fill in while parsing the original string. Then, comparing two urls for equality should be very fast. It also means some extra complexity; it's your call. – wilhelmtell Apr 11 '10 at 07:07
5

@Billy I always bring namespace `std` into my compilation units (not the headers!). I think it's perfectly fine, and I think that having `std::` all over the place poses more pollution and eye-fatigue than bringing in the namespace. – wilhelmtell Apr 11 '10 at 07:12
15

Funny how things are, on the very contrary I agree with Billy ONeal and remove all `using namespace` I came accross. If you really repeat a symbol, you can always have `using std::string;` but I prefer to have namespace qualification, makes it easier for poor old me to understand where that symbol came from. – Matthieu M. Apr 11 '10 at 11:45
You also don't account for the example.com:port/pathname syntax. – Chris K Jul 14 '10 at 01:33
10

There are a lot of URI/URL forms not supported besides example.com:port/pathname. For instance http:/pathname and more importantly http://username:password@example.com/pathname#section - all the combinations are listed in http://www.ietf.org/rfc/rfc2396.txt - they show the following regex: ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? – jdkoftinoff Oct 18 '10 at 02:52

score 14 · Answer 4 · answered Aug 16 '13 at 20:09

POCO's URI class can parse URLs for you. The following example is shortened version of the one in POCO URI and UUID slides:

#include "Poco/URI.h"
#include <iostream>

int main(int argc, char** argv)
{
    Poco::URI uri1("http://www.appinf.com:88/sample?example-query#frag");

    std::string scheme(uri1.getScheme()); // "http"
    std::string auth(uri1.getAuthority()); // "www.appinf.com:88"
    std::string host(uri1.getHost()); // "www.appinf.com"
    unsigned short port = uri1.getPort(); // 88
    std::string path(uri1.getPath()); // "/sample"
    std::string query(uri1.getQuery()); // "example-query"
    std::string frag(uri1.getFragment()); // "frag"
    std::string pathEtc(uri1.getPathEtc()); // "/sample?example-query#frag"

    return 0;
}

watch out, not sure why when you pass something like "127.0.0.1:443" it won't work as expected... — Gelldur, Oct 24 '22 at 07:15

score 13 · Answer 5 · edited Sep 06 '21 at 22:04

For completeness, there is one written in C that you could use (with a little wrapping, no doubt): https://uriparser.github.io/

[RFC-compliant and supports Unicode]

Here's a very basic wrapper I've been using for simply grabbing the results of a parse.

#include <string>
#include <uriparser/Uri.h>


namespace uriparser
{
    class Uri //: boost::noncopyable
    {
        public:
            Uri(std::string uri)
                : uri_(uri)
            {
                UriParserStateA state_;
                state_.uri = &uriParse_;
                isValid_   = uriParseUriA(&state_, uri_.c_str()) == URI_SUCCESS;
            }

            ~Uri() { uriFreeUriMembersA(&uriParse_); }

            bool isValid() const { return isValid_; }

            std::string scheme()   const { return fromRange(uriParse_.scheme); }
            std::string host()     const { return fromRange(uriParse_.hostText); }
            std::string port()     const { return fromRange(uriParse_.portText); }
            std::string path()     const { return fromList(uriParse_.pathHead, "/"); }
            std::string query()    const { return fromRange(uriParse_.query); }
            std::string fragment() const { return fromRange(uriParse_.fragment); }

        private:
            std::string uri_;
            UriUriA     uriParse_;
            bool        isValid_;

            std::string fromRange(const UriTextRangeA & rng) const
            {
                return std::string(rng.first, rng.afterLast);
            }

            std::string fromList(UriPathSegmentA * xs, const std::string & delim) const
            {
                UriPathSegmentStructA * head(xs);
                std::string accum;

                while (head)
                {
                    accum += delim + fromRange(head->text);
                    head = head->next;
                }

                return accum;
            }
    };
}

+1, I ended up cloning your URL parser lib off github. Much nicer not having to pull in all of boost... — Alan, Oct 21 '13 at 03:43
@Alan I don't know of a URL parser in Boost. cpp-netlib has one, but I've had issues with it (very possibly fixed by now) so I use this one instead. — Elliot Cameron, Apr 15 '15 at 17:36

velcrow · Answer 6 · 2015-01-07T23:56:06.847

//sudo apt-get install libboost-all-dev; #install boost
//g++ urlregex.cpp -lboost_regex; #compile
#include <string>
#include <iostream>
#include <boost/regex.hpp>

using namespace std;

int main(int argc, char* argv[])
{
    string url="https://www.google.com:443/webhp?gws_rd=ssl#q=cpp";
    boost::regex ex("(http|https)://([^/ :]+):?([^/ ]*)(/?[^ #?]*)\\x3f?([^ #]*)#?([^ ]*)");
    boost::cmatch what;
    if(regex_match(url.c_str(), what, ex)) 
    {
        cout << "protocol: " << string(what[1].first, what[1].second) << endl;
        cout << "domain:   " << string(what[2].first, what[2].second) << endl;
        cout << "port:     " << string(what[3].first, what[3].second) << endl;
        cout << "path:     " << string(what[4].first, what[4].second) << endl;
        cout << "query:    " << string(what[5].first, what[5].second) << endl;
        cout << "fragment: " << string(what[6].first, what[6].second) << endl;
    }
    return 0;
}

score 7 · Answer 7 · edited Dec 14 '18 at 20:44

7

The Poco library now has a class for dissecting URI's and feeding back the host, path segments and query string etc.

https://pocoproject.org/pro/docs/Poco.URI.html

edited Dec 14 '18 at 20:44

yan

2,932
1
23
25

answered Sep 01 '11 at 07:44

Tom Makin

71
1
1

score 6 · Answer 8 · answered May 21 '15 at 23:41

Facebook's Folly library can do the job for you easily. Simply use the Uri class:

#include <folly/Uri.h>

int main() {
    folly::Uri folly("https://code.facebook.com/posts/177011135812493/");

    folly.scheme(); // https
    folly.host();   // code.facebook.com
    folly.path();   // posts/177011135812493/
}

Matthew Flaschen · Answer 9 · 2010-04-11T04:34:41.100

5

QT has QUrl for this. GNOME has SoupURI in libsoup, which you'll probably find a little more light-weight.

edited Apr 11 '10 at 04:34

answered Apr 11 '10 at 04:23

Matthew Flaschen

278,309
50
514
539

score 4 · Answer 10 · answered Apr 14 '20 at 18:23

I know this is a very old question, but I've found the following useful:

http://www.zedwood.com/article/cpp-boost-url-regex

It gives 3 examples:

(With Boost)

//sudo apt-get install libboost-all-dev;
//g++ urlregex.cpp -lboost_regex
#include <string>
#include <iostream>
#include <boost/regex.hpp>

using std::string;
using std::cout;
using std::endl;
using std::stringstream;

void parse_url(const string& url) //with boost
{
    boost::regex ex("(http|https)://([^/ :]+):?([^/ ]*)(/?[^ #?]*)\\x3f?([^ #]*)#?([^ ]*)");
    boost::cmatch what;
    if(regex_match(url.c_str(), what, ex)) 
    {
        string protocol = string(what[1].first, what[1].second);
        string domain   = string(what[2].first, what[2].second);
        string port     = string(what[3].first, what[3].second);
        string path     = string(what[4].first, what[4].second);
        string query    = string(what[5].first, what[5].second);
        cout << "[" << url << "]" << endl;
        cout << protocol << endl;
        cout << domain << endl;
        cout << port << endl;
        cout << path << endl;
        cout << query << endl;
        cout << "-------------------------------" << endl;
    }
}

int main(int argc, char* argv[])
{
    parse_url("http://www.google.com");
    parse_url("https://mail.google.com/mail/");
    parse_url("https://www.google.com:443/webhp?gws_rd=ssl");
    return 0;
}

(Without Boost)

#include <string>
#include <iostream>

using std::string;
using std::cout;
using std::endl;
using std::stringstream;

string _trim(const string& str)
{
    size_t start = str.find_first_not_of(" \n\r\t");
    size_t until = str.find_last_not_of(" \n\r\t");
    string::const_iterator i = start==string::npos ? str.begin() : str.begin() + start;
    string::const_iterator x = until==string::npos ? str.end()   : str.begin() + until+1;
    return string(i,x);
}

void parse_url(const string& raw_url) //no boost
{
    string path,domain,x,protocol,port,query;
    int offset = 0;
    size_t pos1,pos2,pos3,pos4;
    x = _trim(raw_url);
    offset = offset==0 && x.compare(0, 8, "https://")==0 ? 8 : offset;
    offset = offset==0 && x.compare(0, 7, "http://" )==0 ? 7 : offset;
    pos1 = x.find_first_of('/', offset+1 );
    path = pos1==string::npos ? "" : x.substr(pos1);
    domain = string( x.begin()+offset, pos1 != string::npos ? x.begin()+pos1 : x.end() );
    path = (pos2 = path.find("#"))!=string::npos ? path.substr(0,pos2) : path;
    port = (pos3 = domain.find(":"))!=string::npos ? domain.substr(pos3+1) : "";
    domain = domain.substr(0, pos3!=string::npos ? pos3 : domain.length());
    protocol = offset > 0 ? x.substr(0,offset-3) : "";
    query = (pos4 = path.find("?"))!=string::npos ? path.substr(pos4+1) : "";
    path = pos4!=string::npos ? path.substr(0,pos4) : path;
    cout << "[" << raw_url << "]" << endl;
    cout << "protocol: " << protocol << endl;
    cout << "domain: " << domain << endl;
    cout << "port: " << port << endl;
    cout << "path: " << path << endl;
    cout << "query: " << query << endl;
}

int main(int argc, char* argv[])
{
    parse_url("http://www.google.com");
    parse_url("https://mail.google.com/mail/");
    parse_url("https://www.google.com:443/webhp?gws_rd=ssl");
    return 0;
}

(Different way without Boost)

#include <string>
#include <stdint.h>
#include <cstring>
#include <sstream>
#include <algorithm>

#include <iostream> 
using std::cerr; using std::cout; using std::endl;

using std::string;

class HTTPURL
{
    private:
        string _protocol;// http vs https
        string _domain;  // mail.google.com
        uint16_t _port;  // 80,443
        string _path;    // /mail/
        string _query;   // [after ?] a=b&c=b

    public:
        const string &protocol;
        const string &domain;
        const uint16_t &port;
        const string &path;
        const string &query;

        HTTPURL(const string& url): protocol(_protocol),domain(_domain),port(_port),path(_path),query(_query)
        {
            string u = _trim(url);
            size_t offset=0, slash_pos, hash_pos, colon_pos, qmark_pos;
            string urlpath,urldomain,urlport;
            uint16_t default_port;

            static const char* allowed[] = { "https://", "http://", "ftp://", NULL};
            for(int i=0; allowed[i]!=NULL && this->_protocol.length()==0; i++)
            {
                const char* c=allowed[i];
                if (u.compare(0,strlen(c), c)==0) {
                    offset = strlen(c);
                    this->_protocol=string(c,0,offset-3);
                }
            }
            default_port = this->_protocol=="https" ? 443 : 80;
            slash_pos = u.find_first_of('/', offset+1 );
            urlpath = slash_pos==string::npos ? "/" : u.substr(slash_pos);
            urldomain = string( u.begin()+offset, slash_pos != string::npos ? u.begin()+slash_pos : u.end() );
            urlpath = (hash_pos = urlpath.find("#"))!=string::npos ? urlpath.substr(0,hash_pos) : urlpath;
            urlport = (colon_pos = urldomain.find(":"))!=string::npos ? urldomain.substr(colon_pos+1) : "";
            urldomain = urldomain.substr(0, colon_pos!=string::npos ? colon_pos : urldomain.length());
            this->_domain = _tolower(urldomain);
            this->_query = (qmark_pos = urlpath.find("?"))!=string::npos ? urlpath.substr(qmark_pos+1) : "";
            this->_path = qmark_pos!=string::npos ? urlpath.substr(0,qmark_pos) : urlpath;
            this->_port = urlport.length()==0 ? default_port : _atoi(urlport) ;
        };
    private:
        static inline string _trim(const string& input)
        {
            string str = input;
            size_t endpos = str.find_last_not_of(" \t\n\r");
            if( string::npos != endpos )
            {
                str = str.substr( 0, endpos+1 );
            }
            size_t startpos = str.find_first_not_of(" \t\n\r");
            if( string::npos != startpos )
            {
                str = str.substr( startpos );
            }
            return str;
        };
        static inline string _tolower(const string& input)
        {
            string str = input;
            std::transform(str.begin(), str.end(), str.begin(), ::tolower);
            return str;
        };
        static inline int _atoi(const string& input) 
        {
            int r;
            std::stringstream(input) >> r;
            return r;
        };
};

int main(int argc, char **argv)
{
    HTTPURL u("https://Mail.google.com:80/mail/?action=send#action=send");
    cout << "protocol: " << u.protocol << endl;
    cout << "domain: " << u.domain << endl;
    cout << "port: " << u.port << endl;
    cout << "path: " << u.path << endl;
    cout << "query: " << u.query << endl;
    return 0;
}

score 3 · Answer 11 · answered Feb 04 '15 at 16:29

3

This library is very tiny and lightweight: https://github.com/corporateshark/LUrlParser

However, it is parsing only, no URL normalization/validation.

answered Feb 04 '15 at 16:29

Sergey K.

24,894
13
106
174

score 2 · Answer 12 · edited May 23 '17 at 12:26

2

Also of interest could be http://code.google.com/p/uri-grammar/ which like Dean Michael's netlib uses boost spirit to parse a URI. Came across it at Simple expression parser example using Boost::Spirit?

edited May 23 '17 at 12:26

Community

1
1

answered Nov 18 '10 at 15:47

Ralf

9,405
2
28
46

Mike Ellery · Answer 13 · 2011-03-23T17:18:09.690

There is the newly released google-url lib:

http://code.google.com/p/google-url/

The library provides a low-level url parsing API as well as a higher-level abstraction called GURL. Here's an example using that:

#include <googleurl\src\gurl.h>

wchar_t url[] = L"http://www.facebook.com";
GURL parsedUrl (url);
assert(parsedUrl.DomainIs("facebook.com"));

Two small complaints I have with it: (1) it wants to use ICU by default to deal with different string encodings and (2) it makes some assumptions about logging (but I think they can be disabled). In other words, the library is not completely stand-alone as it exists, but I think it's still a good basis to start with, especially if you are already using ICU.

its merged with chromium source and no longer maintained separately — Silver Moon, May 13 '15 at 05:33

score 2 · Answer 14 · answered Nov 28 '18 at 18:47

May I offer another self-contained solution based on std::regex :

const char* SCHEME_REGEX   = "((http[s]?)://)?";  // match http or https before the ://
const char* USER_REGEX     = "(([^@/:\\s]+)@)?";  // match anything other than @ / : or whitespace before the ending @
const char* HOST_REGEX     = "([^@/:\\s]+)";      // mandatory. match anything other than @ / : or whitespace
const char* PORT_REGEX     = "(:([0-9]{1,5}))?";  // after the : match 1 to 5 digits
const char* PATH_REGEX     = "(/[^:#?\\s]*)?";    // after the / match anything other than : # ? or whitespace
const char* QUERY_REGEX    = "(\\?(([^?;&#=]+=[^?;&#=]+)([;|&]([^?;&#=]+=[^?;&#=]+))*))?"; // after the ? match any number of x=y pairs, seperated by & or ;
const char* FRAGMENT_REGEX = "(#([^#\\s]*))?";    // after the # match anything other than # or whitespace

bool parseUri(const std::string &i_uri)
{
    static const std::regex regExpr(std::string("^")
        + SCHEME_REGEX + USER_REGEX
        + HOST_REGEX + PORT_REGEX
        + PATH_REGEX + QUERY_REGEX
        + FRAGMENT_REGEX + "$");

    std::smatch matchResults;
    if (std::regex_match(i_uri.cbegin(), i_uri.cend(), matchResults, regExpr))
    {
        m_scheme.assign(matchResults[2].first, matchResults[2].second);
        m_user.assign(matchResults[4].first, matchResults[4].second);
        m_host.assign(matchResults[5].first, matchResults[5].second);
        m_port.assign(matchResults[7].first, matchResults[7].second);
        m_path.assign(matchResults[8].first, matchResults[8].second);
        m_query.assign(matchResults[10].first, matchResults[10].second);
        m_fragment.assign(matchResults[15].first, matchResults[15].second);

        return true;
    }

    return false;
}

I added explanations for each part of the regular expression. This way allows you to choose exactly the relevant parts to parse for the URL that you're expecting to get. Just remember to change the desired regular expression group indices accordingly.

score 2 · Answer 15 · answered Mar 20 '19 at 17:22

A small dependency you can use is uriparser, which recently moved to GitHub.

You can find a minimal example in their code: https://github.com/uriparser/uriparser/blob/63384be4fb8197264c55ff53a135110ecd5bd8c4/tool/uriparse.c

This will be more lightweight than Boost or Poco. The only catch is that it is C.

There is also a Buckaroo package:

buckaroo add github.com/buckaroo-pm/uriparser

score 2 · Answer 16 · answered May 19 '20 at 05:17

I tried a couple of the solutions here, but then decided to write my own that could just be dropped into a project without any external dependencies (except c++17).

Right now, it passes all tests. But, if you find any cases that don't succeed, please feel free to create a Pull Request or an Issue.

I'll keep it up to date and improve its quality. Suggestions welcome! I'm also trying out this design to only have a single, high-quality class per repository so that the header and source can just be dropped into a project (as opposed to building a library or header-only). It appears to be working out well (I'm using git submodules and symlinks in my own projects).

https://github.com/homer6/url

Software Craftsman · Answer 17 · 2016-09-30T16:10:48.253

You could try the open-source library called C++ REST SDK (created by Microsoft, distributed under the Apache License 2.0). It can be built for several platforms including Windows, Linux, OSX, iOS, Android). There is a class called web::uri where you put in a string and can retrieve individual URL components. Here is a code sample (tested on Windows):

#include <cpprest/base_uri.h>
#include <iostream>
#include <ostream>

web::uri sample_uri( L"http://dummyuser@localhost:7777/dummypath?dummyquery#dummyfragment" );
std::wcout << L"scheme: "   << sample_uri.scheme()     << std::endl;
std::wcout << L"user: "     << sample_uri.user_info()  << std::endl;
std::wcout << L"host: "     << sample_uri.host()       << std::endl;
std::wcout << L"port: "     << sample_uri.port()       << std::endl;
std::wcout << L"path: "     << sample_uri.path()       << std::endl;
std::wcout << L"query: "    << sample_uri.query()      << std::endl;
std::wcout << L"fragment: " << sample_uri.fragment()   << std::endl;

The output will be:

scheme: http
user: dummyuser
host: localhost
port: 7777
path: /dummypath
query: dummyquery
fragment: dummyfragment

There are also other easy-to-use methods, e.g. to access individual attribute/value pairs from the query, split the path into components, etc.

Serge Rogatch · Answer 18 · 2020-06-30T17:02:37.680

If you use oatpp for web request handling, you can find its built-in URL parsing useful:

  std::string url = /* ... */;
  oatpp::String oatUrl(url.c_str(), url.size(), false);
  oatpp::String oatHost = oatpp::network::Url::Parser::parseUrl(oatUrl).authority.host->toLowerCase();
  std::string host(oatHost->c_str(), oatHost->getSize());

The above snippet retrieves the hostname. In a similar way:

oatpp::network::Url parsedUrl = oatpp::network::Url::Parser::parseUrl(oatUrl);
// parsedUrl.authority.port
// parsedUrl.path
// parsedUrl.scheme
// parsedUrl.queryParams

score 0 · Answer 19 · answered Dec 20 '17 at 10:21

0

There is yet another library https://snapwebsites.org/project/libtld which handles all possible top level domains and URI shema

answered Dec 20 '17 at 10:21

Larytet

648
3
13

Fabiano Tarlao · Answer 20 · 2019-12-23T08:10:39.370

I have developed an "object oriented" solution, one C++ class, that works with one regex like @Mr.Jones and @velcrow solutions. My Url class performs url/uri 'parsing'.

I think I improved velcrow regex to be more robust and includes also the username part.

Follows the first version of my idea, I have released the same code, improved, in my GPL3 licensed open source project Cpp URL Parser.

Omitted #ifdef/ndef bloat part, follows Url.h

#include <string>
#include <iostream>
#include <boost/regex.hpp>

using namespace std;

class Url {
public:
    boost::regex ex;
    string rawUrl;

    string username;
    string protocol;
    string domain;
    string port;
    string path;
    string query;
    string fragment;

    Url();

    Url(string &rawUrl);

    Url &update(string &rawUrl);
};

This is the code of the Url.cpp implementation file:

#include "Url.h"

Url::Url() {
    this -> ex = boost::regex("(ssh|sftp|ftp|smb|http|https):\\/\\/(?:([^@ ]*)@)?([^:?# ]+)(?::(\\d+))?([^?# ]*)(?:\\?([^# ]*))?(?:#([^ ]*))?");
}

Url::Url(string &rawUrl) : Url() {
    this->rawUrl = rawUrl;
    this->update(this->rawUrl);
}

Url &Url::update(string &rawUrl) {
    this->rawUrl = rawUrl;
    boost::cmatch what;
    if (regex_match(rawUrl.c_str(), what, ex)) {
        this -> protocol = string(what[1].first, what[1].second);
        this -> username = string(what[2].first, what[2].second);
        this -> domain = string(what[3].first, what[3].second);
        this -> port = string(what[4].first, what[4].second);
        this -> path = string(what[5].first, what[5].second);
        this -> query = string(what[6].first, what[6].second);
        this -> fragment = string(what[7].first, what[7].second);
    }
    return *this;
}

Usage example:

string urlString = "http://gino@ciao.it:67/ciao?roba=ciao#34";
Url *url = new Url(urlString);
std::cout << " username: " << url->username << " URL domain: " << url->domain;
std::cout << " port: " << url->port << " protocol: " << url->protocol;

You can also update the Url object to represent (and parse) another URL:

url.update("http://gino@nuovociao.it:68/nuovociao?roba=ciaoooo#")

I'm not a full-time C++ developer, so, I'm not sure I followed 100% C++ best-practises. Any tip is appreciated.

P.s: let's look at Cpp URL Parser, there are refinements there.

Have fun

score 0 · Answer 21 · answered Dec 09 '20 at 08:26

0

simple solution to get the protocol, host, path

int url_get(const std::string& uri)
{
  //parse URI
  std::size_t start = uri.find("://", 0);
  if (start == std::string::npos)
  {
    return -1;
  }
  start += 3; //"://"
  std::size_t end = uri.find("/", start + 1);
  std::string protocol = uri.substr(0, start - 3);
  std::string host = uri.substr(start, end - start);
  std::string path = uri.substr(end);
  return 0;
}

answered Dec 09 '20 at 08:26

Pedro Vicente

681
2
9
21

What does your code return in the `host` from `ftp://user:passwd@example.com:1555/docs/Java&C++`? – Aleksey F. Nov 20 '21 at 01:21

Easy way to parse a url in C++ cross platform?

21 Answers21

url.hh

url.cc

main.cc

Linked