0

I'm trying to build a complicated regex. I want to match a regex of the following structure:

  1. .+ (any character, at least once)
  2. either "del" or "ins" or "dup" or [ATGC]
  3. .* (string ends or is followed by whatever)

I have tried different things and at the moment I am here, which doesn't work:

hgvs = "c.*1017delT"
a = re.match('(.*)(del|ins|dup|[ATGC]).*', hgvs)
a.groups()
('c.*1017del', 'T')

I expect to catch everything before the "del" with "(.*)". But he seems to apply the [ATGC] match over the del match.

Robert
  • 432
  • 1
  • 4
  • 15

1 Answers1

0

Try non-greedy match:

re.match('(.*?)(del|ins|dup|[ATGC]).*', hgvs)
             ^

With the non-greedy qualifier, the first .*? will match as few as possible.

P.S. If you learn more regex, you won't think this one is "complex" because there are far more really complex regex syntax.

iBug
  • 35,554
  • 7
  • 89
  • 134