1

I have the below method in Python (3.10):

import re

def camel_to_screaming_snake(value):
    return '_'.join(re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]'
                                        , value)).upper()

The goal I am trying to accomplish is to insert an underscore every time the case in a string changes, or between a non-numeric character and a numeric character. However, the following cases being passed in are not yielding expected results. (left is what is passed in, right is what is returned)

vn3b -> VN3B
vnRbb250V -> VN_RBB_250V

I am expecting the following return values:

vn3b -> VN_3_B
vnRbb250V -> VN_RBB_250_V

What is wrong with my regex preventing this from working as expected?

work89
  • 75
  • 8
  • It's not clear why you expected otherwise. I'd recommend using a regex debugger, e.g. https://regex101.com/r/P0HdBq/1. – jonrsharpe Jan 25 '23 at 20:18
  • The reason why I expected otherwise is because I am unfamiliar with regex (and how to use the debugger you passed along, for that matter). I was hoping someone would be able to provide the correct regex with an explanation of where I went wrong. The very presence of this question indicates that. – work89 Jan 25 '23 at 20:25
  • Then you could have a look at e.g. https://stackoverflow.com/q/4736/3001761. A post for every possible use regex can be put to isn't the valuable Q&A SO is optimising for. – jonrsharpe Jan 25 '23 at 20:26

1 Answers1

2

You want to catch particular boundaries and insert an _ in them. To find a boundary, you can use the following regex format: (?<=<before>)(?=<after>). These are non capturing lookaheads and lookbehinds where before and after are replaced with the conditions.

In this case you have defined 3 ways that you might need to find a boundary:

  • A lowercase letter followed by an uppercase letter: (?<=[a-z])(?=[A-Z])
  • A number followed by a letter: (?<=[0-9])(?=[a-zA-Z])
  • A letter followed by a number: (?<=[a-zA-Z])(?=[0-9])

You can combine these into a single regex with an or | and then call sub on it to replace the boundaries. Afterwards you can uppercase the output:

import re

boundaries_re = re.compile(
    r"((?<=[a-z])(?=[A-Z]))"
    r"|((?<=[0-9])(?=[a-zA-Z]))"
    r"|((?<=[a-zA-Z])(?=[0-9]))"
)

def format_var(text):
    return boundaries_re.sub("_", text).upper()
>>> format_var("vn3b")
VN_3_B
>>> format_var("vnRbb250V")
VN_RBB_250_V
flakes
  • 21,558
  • 8
  • 41
  • 88
  • 1
    Perfect. I wasn't familiar with boundaries. Thanks for pointing out compiling in advance as well for improved performance. – work89 Jan 25 '23 at 20:40