1

I am having some trouble trying to figure out how to use regular expressions in python. Ultimately I am trying to do what sscanf does for me in C.

I am trying to match given strings that look like so:

12345_arbitrarystring_2020_05_20_10_10_10.dat

I (seem) to be able to validate this format by calling match on the following regular expression

regex = re.compile('[0-9]{5}_.+_[0-9]{4}([-_])[0-9]{2}([-_])[0-9]{2}([-_])[0-9]{2}([:_])[0-9]{2}([:_])[0-9]{2}\\.dat')

(Note that I do allow for a few other separators then just '_')

I would like to split the given string on these separators so I do:

regex = re.compile('[_\\-:.]+')
parts = regex.split(given_string) 

This is all fine .. the problem is that I would like my 'arbitrarystring' part to include '-' and '_' and the last split currently, well, splits them.

Other than manually cutting the timestamp and the first 5 digits off that given string, what can I do to get that arbitrarystring part?

Lieuwe
  • 1,734
  • 2
  • 27
  • 41

2 Answers2

2

You could use a capturing group to get the arbitrarystring part and omit the other capturing groups.

You could for example use a character class to match 1+ word characters or a hyphen using [\w-]+

If you still want to use split, you could add capturing groups for the first and the second part, and split only those groups.

^[0-9]{5}_([\w-]+)_[0-9]{4}[-_][0-9]{2}[-_][0-9]{2}[-_][0-9]{2}[:_][0-9]{2}[:_][0-9]{2}\.dat$
          ^^^^^^^^

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
2

It seems to be possible to cut down your regex to validate the whole pattern to:

^\d{5}_(.+?)_\d{4}[-_](?:\d{2}[-_]){2}(?:\d{2}[:_]){2}\d{2}\.dat$

Refer to group 1 for your arbitrary string.

Online demo


Quick reminder: You didn't seem to have used raw strings, but instead escaping with a double backslash. Python has raw strings which makes you don't have to escape backslashes nomore.

JvdV
  • 70,606
  • 8
  • 39
  • 70