I'm debugging an existing Boost QI grammar in our software for parsing "endpoints" (host:port
, where host
can be a hostname, IPv4 address, or IPv6 address). I'm specifically having a problem with the hostname
part (the port, IPv4, and IPv6 parts all work perfectly). The full code is extensive, so I've simplified it down to the following example, which can be easily run on Wandbox (C++17, Boost 1.79.0)
#include <iostream>
#include <string>
#include <boost/fusion/include/std_pair.hpp>
#include <boost/fusion/include/boost_array.hpp>
#include <boost/phoenix.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/algorithm/string.hpp>
class Endpoint
{
public:
Endpoint(): _host(""), _port(0) {};
std::string _host;
unsigned short _port;
};
template < typename ITERATOR >
struct HostnameGrammar :
boost::spirit::qi::grammar< ITERATOR, std::string() >
{
HostnameGrammar()
:
HostnameGrammar::base_type( start )
{
using boost::spirit::qi::alnum;
using boost::spirit::qi::alpha;
using boost::spirit::qi::char_;
using boost::spirit::_1;
start %=
dottedName | singleName
;
dottedName %=
+( singleName >> char_( '.' ) ) >>
tld
;
singleName %=
+alnum >> *( char_('-') >> +alnum )
;
tld %=
+alpha >> *( -char_('-') >> +alnum )
;
}
boost::spirit::qi::rule< ITERATOR, std::string() > start;
boost::spirit::qi::rule< ITERATOR, std::string() > dottedName;
boost::spirit::qi::rule< ITERATOR, std::string() > singleName;
boost::spirit::qi::rule< ITERATOR, std::string() > tld;
};
template < typename ITERATOR >
struct EndpointGrammar :
boost::spirit::qi::grammar< ITERATOR, std::string() >
{
EndpointGrammar(
Endpoint & endpoint )
:
EndpointGrammar::base_type( start ),
endpoint_( endpoint )
{
using boost::spirit::qi::ushort_;
using boost::spirit::_1;
start =
-address[ boost::phoenix::ref( endpoint._host ) = _1 ] >>
-( ':' >> ushort_[ boost::phoenix::ref( endpoint._port ) = _1 ] )
;
address %=
hostname;
}
Endpoint & endpoint_;
boost::spirit::qi::rule< ITERATOR, std::string() > start;
boost::spirit::qi::rule< ITERATOR, std::string() > address;
HostnameGrammar< ITERATOR > hostname;
};
int main()
{
std::vector< std::string > endpointStrings {
// these should parse successfully (and do)
"0foo", "foo", "foo.net", "foo:1234", "foo.net:5678", "foo.example.net", "foo.example.net:9012",
"foo-bar", "foo-bar.com", "foo-bar:1234", "foo-bar.net-0:5678", "foo-bar.example.com-1:9012",
// these should fail to parse (and do)
"foo.0bar", "foo.0bar:1234", "foo.bar-", "foo.bar-:1234", "foo-", "-foo"
};
for ( auto const & endpointString : endpointStrings )
{
Endpoint tempEndpoint;
std::string::const_iterator beginIt( endpointString.begin() );
std::string::const_iterator endIt( endpointString.end() );
EndpointGrammar< std::string::const_iterator > grammar( tempEndpoint );
if ( !boost::spirit::qi::parse( beginIt, endIt, grammar ) || beginIt != endIt )
{
std::cout << "Failed: " << endpointString << std::endl;
}
else
{
std::cout << "Succeeded: " << endpointString << " = " << tempEndpoint._host << " / " << tempEndpoint._port << std::endl;
}
}
return 0;
}
The grammar successfully parses or fails to parse all the examples that it should parse or fail to parse per RFC rules, so that great. The problem is that whatever part is the "last" part of the hostname gets "doubled." Here's the output of running that:
Succeeded: 0foo = 0foo0foo / 0
Succeeded: foo = foofoo / 0
Succeeded: foo.net = foo.netnet / 0
Succeeded: foo:1234 = foofoo / 1234
Succeeded: foo.net:5678 = foo.netnet / 5678
Succeeded: foo.example.net = foo.example.netnet / 0
Succeeded: foo.example.net:9012 = foo.example.netnet / 9012
Succeeded: foo-bar = foo-barfoo-bar / 0
Succeeded: foo-bar.com = foo-bar.comcom / 0
Succeeded: foo-bar:1234 = foo-barfoo-bar / 1234
Succeeded: foo-bar.net-0:5678 = foo-bar.net-0net-0 / 5678
Succeeded: foo-bar.example.com-1:9012 = foo-bar.example.com-1com-1 / 9012
Failed: foo.0bar
Failed: foo.0bar:1234
Failed: foo.bar-
Failed: foo.bar-:1234
Failed: foo-
Failed: -foo
Note "foofoo", "foo.netnet", "foo.example.netnet", etc.
I've tried about 300 different variations of these rules—far too many to include them all here. Suffice it to say that many variations successfully identify valid vs invalid hostnames, but all those that do so correctly also suffer from the same duplicated last-part problem. I'm all out of ideas over here. Anyone know why this doesn't work and how to fix it?
Note that I've even tried the simplest possible alphabetic-only rule to achieve what I'm trying to achieve in order to expand on that rule with my more-complicated character sets, but even this duplicates "foo.net" into "foo.netnet":
start %=
+( +alpha >> char_( '.' ) ) >> +alpha
;