I am parsing a multiline text records that look like below:
> UniRef50_A0A091LJV8 Lysozyme g (Fragment) n=2 Tax=Chlamydotis
macqueenii RepID=A0A091LJV8_9GRUI
Length=186
Score = 114 bits (285), Expect = 3e-30, Method: Compositional matrix adjust.
Identities = 54/83 (65%), Positives = 65/83 (78%), Gaps = 0/83 (0%)
Query 1 ASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESH 60
AS TA+PEGLSY GVSAS+KIAE+DL+ M +++ I +V V+PA+IAGIISRESH
Sbjct 17 ASEATARPEGLSYAGVSASEKIAEKDLKNMQKHQDKITRVANSKGVDPALIAGIISRESH 76
Query 61 AGKVLKNGWGDRGNGFGLMQVDK 83
G VL+NGWGD N FGLMQVDK
Sbjct 77 GGTVLENGWGDHNNAFGLMQVDK 99
I use a few regular expressions to extract data from such records. All they work when compiled with clang (MacOS X) and gcc 4.9.2 (Ubuntu). One of them however throws a regex_error
when compiled with gcc. Here is the Minimal (non)-Working Example:
#include <regex>
const std::string regex_string_OK_1 = "\\[(.+?)\\]";
const std::string regex_string_OK_2 = "Tax\\s*?=\\s*?([\\n\\w ]*?)\\s*?RepID";
const std::string regex_string_PROBLEM = "Query\\s+?(\\d+?)\\s+?([_\\-[:alnum:]]+?)\\s+?(\\d+?)\\n.+?\\nSbjct\\s+?(\\d+?)\\s+?([_\\-[:alnum:]]+?)\\s+?(\\d+?)\\n";
int main(int argc, const char *argv[]) {
std::regex regex_OK_1(regex_string_OK_1);
std::regex regex_OK_2(regex_string_OK_2);
std::regex regex_PROBLEM(regex_string_PROBLEM); // This line throws regex_error on Ubuntu
return 0;
}
I tested all the regex string with https://myregextester.com, they work just fine. Also, the code compiled on MacOS with clang parses lots of real case data with no problems. But now I have to run the code on Linux/gcc system.