2

I am parsing a multiline text records that look like below:

> UniRef50_A0A091LJV8 Lysozyme g (Fragment) n=2 Tax=Chlamydotis 
macqueenii RepID=A0A091LJV8_9GRUI
Length=186

 Score =   114 bits (285),  Expect = 3e-30, Method: Compositional matrix adjust.
 Identities = 54/83 (65%), Positives = 65/83 (78%), Gaps = 0/83 (0%)

Query  1   ASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESH  60
       AS  TA+PEGLSY GVSAS+KIAE+DL+ M +++  I +V     V+PA+IAGIISRESH
Sbjct  17  ASEATARPEGLSYAGVSASEKIAEKDLKNMQKHQDKITRVANSKGVDPALIAGIISRESH  76

Query  61  AGKVLKNGWGDRGNGFGLMQVDK  83
            G VL+NGWGD  N FGLMQVDK
Sbjct  77  GGTVLENGWGDHNNAFGLMQVDK  99

I use a few regular expressions to extract data from such records. All they work when compiled with clang (MacOS X) and gcc 4.9.2 (Ubuntu). One of them however throws a regex_error when compiled with gcc. Here is the Minimal (non)-Working Example:

#include <regex>

const std::string regex_string_OK_1 = "\\[(.+?)\\]";
const std::string regex_string_OK_2 = "Tax\\s*?=\\s*?([\\n\\w ]*?)\\s*?RepID";
const std::string regex_string_PROBLEM = "Query\\s+?(\\d+?)\\s+?([_\\-[:alnum:]]+?)\\s+?(\\d+?)\\n.+?\\nSbjct\\s+?(\\d+?)\\s+?([_\\-[:alnum:]]+?)\\s+?(\\d+?)\\n";

int main(int argc, const char *argv[]) {

std::regex regex_OK_1(regex_string_OK_1);
std::regex regex_OK_2(regex_string_OK_2);

std::regex regex_PROBLEM(regex_string_PROBLEM); // This line throws regex_error on Ubuntu

  return 0;
}

I tested all the regex string with https://myregextester.com, they work just fine. Also, the code compiled on MacOS with clang parses lots of real case data with no problems. But now I have to run the code on Linux/gcc system.

tnorgd
  • 1,580
  • 2
  • 14
  • 24

1 Answers1

2

I had to completely reedit this answer, as I tested on http://melpon.org/wandbox/ your code under clang and gcc in various versions, I am starting to think that gcc does not recognize \- as a valid escape for hyphen (actually in all versions).

Your example seems correct to me now: [_\\-[:alnum:]] already contains escape for hyphen : \\- but for some reasons gcc does not like it. So I suggest following character class:

 `[-_[:alnum:]]`

if you need to match also slash : \ then you should add \\\\ (I assumed previously that was Your intention).

ps. my previous answer left \\ which on the other hand caused exceptions on clang, but that was incorrect regexp as it ended in escaping bracket: \[ which was nonsense - but why not on gcc?

marcinj
  • 48,511
  • 9
  • 79
  • 100
  • I accepted that answer because that fixed problems with gcc. But now the code throws an exception on MacOS (clang)... – tnorgd Jun 08 '16 at 21:03
  • @tnorgd see my edit. also you can quicly test your code under various versions of gcc/clang here: http://melpon.org/wandbox/permlink/lg9fG7E5Yu2KqMeK – marcinj Jun 08 '16 at 21:39