1

I had a functioning code with working re.compile (so I know the majority of the code is good). The data set has now changed slightly and I can't figure out the correct re.complie mask for the new search string.

These are my original working masks. The S0 masks looks for a line starting S0 and ending with 11:

V1=re.compile(r"^V0")    
S011=re.compile(r"^S0\w*\W* 11\b")
S012=re.compile(r"^S0\w*\W* 12\b")
CDP111=re.compile(r"^C0\w*\W*111\b")
CDP112=re.compile(r"^C0\w*\W*113\b")
CDP121=re.compile(r"^C0\w*\W*122\b")
CDP122=re.compile(r"^C0\w*\W*124\b")
T011=re.compile(r"^T0\w*\W*1 1\b")
T012=re.compile(r"^T0\w*\W*1 2\b")

and this is data it was targeting - There are various different S0,T0,C0 variants (not all shown here):

V002PA081       1    1001655114.94N0072241.09E 425969.37304794.3 344.61290000331
S002PA081       11   1001655111.95N0072236.07E 425903.27304703.3 344.61290000331
T002PA081       1 1  1001655035.28N0072141.99E 425188.07303586.1 344.61290000331
T002PA081       1 2  1001655034.63N0072144.42E 425218.37303565.1 344.61290000331
C002PA081       111  1001655111.64N0072235.08E 425890.47304694.0 344.61290000331
C002PA081       113  1001655111.40N0072235.96E 425901.37304686.3 344.61290000331

The new data looks like this. All the digits are in the same place its just the first string is different:

V0EQ21309-128   1    1001600535.99N0023642.63E 478409.56662024.7 107.91520748348
S0EQ21309-128   11   1001600532.60N0023645.31E 478450.26661919.6   0.01520748348
T0EQ21309-128   1 1  1001600452.10N0023713.63E 478880.66660664.0   0.01520748348
T0EQ21309-128   1 2  1001600452.35N0023715.00E 478901.96660671.8   0.01520748348
C0EQ21309-128   111  1001600532.07N0023645.24E 478449.06661903.2   0.01520748348
C0EQ21309-128   113  1001600532.22N0023646.01E 478461.06661907.6   0.01520748348

The re.compile for V0 works because it is only looking for V0. The others fail to find a target', I assume because the mask isn't correctly dealing with the '-128' part. The end will not always be '-128', it can be any variation on '-nnn'.

The various results need to go into specific lists so I can't simple search for a line starting with 'C' as there are 4 variants.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
WillH
  • 281
  • 2
  • 13
  • 1
    So, add `(?:-\d+)?` after `\w*` to match an optional `-` and one or more digits. – Wiktor Stribiżew Jun 01 '21 at 12:21
  • Is there a reason why you are using ``re`` in the first place? This looks like a strictly columnar format, e.g. the name is the first 16 characters, the subtag in the next 4 characters, the location the next 25 characters and so on. Splitting by characters into these columns, and then just checking e.g. whether "name starts with ``S0`` and tag is ``11``" is very straightforward and much more robust than convoluting columns and content. – MisterMiyagi Jun 01 '21 at 12:31
  • @MisterMiyagi this is just an excerpt from the whole document, there are full width text strings and further on the data is split into smaller repeating sections, the whole pattern then repeats again and again so it would take a long time to split out the various groups of different column formats. – WillH Jun 01 '21 at 13:42

1 Answers1

1

It seems you can add the (?:-\d+)? part to each of your regex right after \w* except V1 since it only checks for V0 at the start of the string:

V1=re.compile(r"^V0")    
S011=re.compile(r"^S0\w*(?:-\d+)?\W* 11\b")
S012=re.compile(r"^S0\w*(?:-\d+)?\W* 12\b")
CDP111=re.compile(r"^C0\w*(?:-\d+)?\W*111\b")
CDP112=re.compile(r"^C0\w*(?:-\d+)?\W*113\b")
CDP121=re.compile(r"^C0\w*(?:-\d+)?\W*122\b")
CDP122=re.compile(r"^C0\w*(?:-\d+)?\W*124\b")
T011=re.compile(r"^T0\w*(?:-\d+)?\W*1 1\b")
T012=re.compile(r"^T0\w*(?:-\d+)?\W*1 2\b")

See an example S011 regex demo.

The (?:-\d+)? part is a non-capturing group that matches an optional sequence of - and then one or more digits.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    @WiktorStribizw. That works perfectly and now handles both the old and new data format. Thanks also for the link to the regex demo, very useful. – WillH Jun 01 '21 at 13:24