I am new to writing parsers. I am attempting to create a parser which can extract US zip codes from input text. I have created the following parser patterns, which do most of what I want. I am able to match 5 digit zip codes, or 9 digit zip codes (90210-1234) as expected.
However, it does not allow me to avoid matching things like:
246764 (returns 46764)
578397 (returns 78397)
I wanted to specify some anchoring conditions for the right and left of the above pattern, in the hopes that I could eliminate the examples above. More specifically, I want to prohibit matching when digits or dashes are adjacent to the beginning or end of the candidate zip code.
Test data (bold entries should be matched)
12345
foo456
ba58r
246764anc
578397
90210-
15206-1
15222-1825
15212-4267-53410-2807
Full code:
using It = std::string::const_iterator;
using ZipCode = boost::fusion::vector<It, It>;
namespace boost { namespace spirit { namespace x3 { namespace traits {
template <>
void move_to<It, ZipCode>(It b, It e, ZipCode& z)
{
z =
{
b,
e
};
}}}}}
void Parse(std::string const& input)
{
auto start = std::begin(input);
auto begin = start;
auto end = std::end(input);
ZipCode current;
std::vector<ZipCode> matches;
auto const fiveDigits = boost::spirit::x3::repeat(5)[boost::spirit::x3::digit];
auto const fourDigits = boost::spirit::x3::repeat(4)[boost::spirit::x3::digit];
auto const dash = boost::spirit::x3::char_('-');
auto const notDashOrDigit = boost::spirit::x3::char_ - (dash | boost::spirit::x3::digit);
auto const zipCode59 =
boost::spirit::x3::lexeme
[
-(¬DashOrDigit) >>
boost::spirit::x3::raw[fiveDigits >> -(dash >> fourDigits)] >>
¬DashOrDigit
];
while (begin != end)
{
if (!boost::spirit::x3::phrase_parse(begin, end, zipCode59, boost::spirit::x3::blank, current))
{
++begin;
}
else
{
auto startOffset = std::distance(start, boost::fusion::at_c<0>(current));
auto endOffset = std::distance(start, boost::fusion::at_c<1>(current));
auto length = std::distance(boost::fusion::at_c<0>(current), boost::fusion::at_c<1>(current));
std::cout << "Matched (\"" << startOffset
<< "\", \""
<< endOffset
<< "\") => \""
<< input.substr(startOffset, length)
<< "\""
<< std::endl;
}
}
}
This code with the above test data produces the following output:
Matched ("0", "5") => "12345"
Matched ("29", "34") => "46764"
Matched ("42", "47") => "78397"
Matched ("68", "78") => "15222-1825"
If I change zipCode59 to the following, I get no hits back:
auto const zipCode59 =
boost::spirit::x3::lexeme
[
¬DashOrDigit >>
boost::spirit::x3::raw[fiveDigits >> -(dash >> fourDigits)] >>
¬DashOrDigit
];
I have read through this question: Stop X3 symbols from matching substrings . However, this question makes use of a symbol table. I don't think this can work for me, because I lack the ability to specify hard-coded strings. I'm also unclear as to how the answer to that question manages to prohibit the leading content.