1

TL;DR;

I want to transform a string (representing a regex) like "\\." into "\." in a clean and resilient way (something akin to sed 's/\\\\/\\/g', I don't know if this could break on edge cases though)
val.decode('string-escape') is not an option since I'm using python3.

What I tried so far:

  • variations of val.replace('\\\\', '\\')
  • looked at the answers to these two questions but couldn't get them to work in my case
    • variations of val.encode().decode('unicode-escape')
  • had a look at the docs for strings but couldn't find a solution

I am sure that I missed a relevant part, because string escaping (and unescaping) seems like a fairly common and basic problem, but I haven't found a solution yet =/

Full Story:

I have a YAML-File like so

- !Scheme
      barcode: _([ACGTacgt]+)[_.]
      lane: _L(\d\d\d)[_.]
      name: RKI
      read: _R(\d)+[_.]
      sample_name: ^(.+)(?:_.+){5}
      set: _S(\d+)[_.]
      user: _U([a-zA-Z0-9\-]+)[_.]
      validation: .*/(?:[a-zA-Z0-9\-]+_)+(?:[a-zA-Z0-9])+\.fastq.*
...

that describes a "Scheme" Object. The 'name' key is an identifier and the rest describe regexes.

I want to be able to parse an object from that YAML so I wrote a from_yaml class method:

scheme = Scheme()
loaded_mapping = loader.construct_mapping(node)  # load yaml-node as dictionary WARNING! loads str escaped

# re.compile all keys except name, adding name as regular string and
# unescaping escaped sequences (like '\') in the process
for key, val in loaded_mapping.items():
    if key == 'name':
        processed_val = val
    else:
        processed_val = re.compile(val)  # backslashes in val are escaped
    scheme.__dict__[key] = processed_val

the problem is that loader.construct_mapping(node) loads the strings with backslashes escaped, so the regex is not correct anymore.

I tried several variations of val.encode().decode('unicode-escape') and val.replace('\\\\', '\\'), but had no luck with it

If anyone has an idea how to handle this I'd appreciate it very much! I am not married to this specific way of doing things and open to alternative approaches.

Kind Regards!

Fynn
  • 303
  • 1
  • 11
  • The question is unclear. When you have a string literal `"\\."` then this **is** the string `\.`. What do you mean by "represent"? – Tomalak Mar 06 '20 at 18:52
  • by "represent" I mean that this should ultimately get `re.compile()`ed – Fynn Mar 07 '20 at 19:22

1 Answers1

1

Assuming I have this super simple YAML file

lane: _L(\d\d\d)[_.]

and load it with PyYAML like this:

import yaml
import re

with open('test.yaml', 'rb') as stream:
    data = yaml.safe_load(stream)

lane_pattern = data['lane']
print(lane_pattern)

lane_expr = re.compile(data['lane'])
print(lane_expr)

Then the result is exactly as one would expect:

_L(\d\d\d)[_.]
re.compile('_L(\\d\\d\\d)[_.]')

There is no double escaping of strings going on when YAML is parsed, so there is nothing for you to unescape.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • 1
    my bad you're right, I just messed up the actual regex! thanks for your trouble ^^' – Fynn Mar 07 '20 at 19:22
  • @Fynn I thought so, but suspected that some code you did not write might be responsible, hence the minimum attempt to reproduce. – Tomalak Mar 07 '20 at 19:41