How to Create Variable Definitions from Phrases using Regexes (Notepad++)

Question

Suppose we have a list of phrases - words separated by spaces. And suppose we want to define a bunch of variables based on these phrases such that the following hold:

Phrases already exist and are surrounded by quotes (if not, you can easily use a regex to achieve this)
Phrases only contain letters (this actually isn't true for me in practice, but I can handle those cases manually)
Variable name, followed by an equals sign, should precede the phrase
Variable name should be a lowerCamelCase version of the phrase

Example

Input

"hello World"
"foo bAr"

Expected Output

helloWorld = "hello World"
fooBar = "foo bAr"

Use Case

Often in my line of work I am presented with a bunch of constants which come from an Excel spreadsheet and I need to define a bunch of variables in code for them. The phrases have spaces in them, but the variables can't. I'd usually like to keep the variable names as close to the phrases as I can. I'd like a way to do it in bulk, without having to individually type out each variable name.

Notes

I have come up with a way to do this, which I want to record here in case I need it in future and in case others might need it. I also want to post it here because I have a feeling there are optimizations that can be made to my process, or at least alternatives.

Although you've shared the context/use case, making it not quite an x-y problem, are you sure a bunch of generated variables is a solution that makes sense for this scenario? It seems like a data structure of key-value pairs or arrays might be more appropriate. In other words, decide what the data is _about_, then collect it accordingly. To take a contrived example, if all of the strings describe client names, for example, importing the Excel data using a library and storing it in an array of `clients = []` seems far more practical than `janeDoe = "Jane Doe"; johnDoe = "John Doe"` (+10K more). — ggorlen, Feb 18 '20 at 18:34
Yes you are correct - in that case we'd be talking about data, not variables. The use case for variables applies when I'm taking it off headers. Some Excel sheets are like poor databases. They have a bunch of records in the rows and then column headers. The above would apply to the headers, but not the records as you astutely observe. — Colm Bhandal, Feb 18 '20 at 18:44
Even here, I don't see a case when this would be appropriate. If the headers were `"foo bar"` and `"baz quux"`, it'd be most normal to create a data structure like `{"foo bar": [...all the foo bars...], "baz quux": [...all the baz quuxes...]}` rather than `fooBar = [...all the foo bars...]` and `bazQuux = [...all the baz quuxes...]`. [Relevant canonical answer](https://stackoverflow.com/questions/1373164/how-do-i-create-a-variable-number-of-variables) and [JS version](https://stackoverflow.com/questions/5187530/variable-variables-in-javascript). — ggorlen, Feb 18 '20 at 18:51
Yes, that's perfectly valid to do. We have code that does this- wraps the data into dictionaries for easy lookup. But we still want variables to do the lookup so that if we do the same lookup in multiple places we don't have to duplicate a string constant in code. — Colm Bhandal, Feb 18 '20 at 18:56
That also seems like an antipattern--I don't see the harm in using a string constant to key into a dict. But I don't mean to argue, I'm glad you solved a problem with this. — ggorlen, Feb 18 '20 at 18:57
No worries. It's messy for sure. I appreciate your feedback on this. I don't see any way to do this while respecting the DRY principle without creating the variables in code. — Colm Bhandal, Feb 18 '20 at 18:59
I don't think DRY applies to keying into a data structure. If they key name is `foo`, I see no harm in saying `data_structure.foo` or `data_structure["foo"]`. Variable keys are useful when the key is, well, actually variable. Something like `const fooKey = "foo";` and later on in the code `data_structure[foo_key]` feels like taking the advice to such an extreme that it actually hurts readability. — ggorlen, Feb 18 '20 at 19:08
I disagree. With the amount of headers that we have to manage, we wouldn't want to be typing in raw strings every time and hoping that they're right. Or copying and pasting them in from Excel every time we need them. This is faster. It provides us with objects with hundreds of properties that we can navigate through easily with Intellisense and it gives us the security of the whole thing being more typesafe at compile time. And it's actually easier to read in many cases, because our cleaned UpperCamelCase variables often eliminate noisy characters from headers. — Colm Bhandal, Feb 21 '20 at 17:44
So maybe I was wrong to quote DRY as the primary reason. Although it is an added benefit: we only define the variable once, and can, in theory, use it multiple times. — Colm Bhandal, Feb 21 '20 at 17:45

Colm Bhandal · Answer 1 · 2021-03-09T15:05:12.003

I haven't found a single find/replace step that'll do everything for you, but I have managed to do it using a sequence of regexes, applied one after another. The first pulls out the content for the variable name and inserts the "=". The next one does the main heavy lifting and removes spaces and applies the correct casing. The final one ensures all variables begin with lowercase letters. Apply them in sequence to achieve the desired result.

Regex #1

All we're doing here is pulling content out of the quotes and inserting it on the left hand side.

Note: here and below, I need to use this character for whitespace because SO doesn't render it correctly: ␣. So replace that with a space when you use this or other regexes in this answer.

Find: "(.+)"

Replace: ␣\1 = "\1" (note the leading space)

In our example, after this step, we end up with:

hello World = "hello World"
foo bAr = "foo bAr"

Regex #2

Here, we want to match each word on the left hand side, with the goal of removing the whitespace and simultaneously fixing the casing.

Find: ␣(\S)(\S+)(?=.*=) (note the leading space)

Replace: \u\1\L\2 (absence of space in replacement pattern achieves the removal of the space)

After this step, we end up with:

HelloWorld = "hello World"
FooBar = "foo bAr"

Correct, except for the first letter of each variable name.

Regex #3

This fixes the leading characters to be lowercase:

Find: ^(.)

Replace: \l\1

After this step, our output is as desired:

helloWorld = "hello World"
fooBar = "foo bAr"

Optional Regex #4 (Remove Invalid Characters)

Though the requirement assumed all letters, this is often not the case. First, you may want some numbers in there. Second, there may be junk like parentheses. In this case, just do a find/replace with the replace expression empty for the following find expression:

[^\w\r\n](?!=)(?=.*=)

What that does is first matches negatively to anything that's not a letter, a digit, an underscore or an end of line character. It then ensures that the match is followed by an = down the line but not immediately followed by an =, meaning the space before the = is preserved.

As a Macro

Rather than manually do all 4 steps above, you can record them as a macro and save it to your Notepad++. Or just paste the XML below inside the <Macros> XML element in the file shortcuts.xml inside %appdata%\Notepad++. If you do paste, the shortcut is ctrl+alt+shift+V, but you can change that to whatever you want:

<Macro name="DefineVariables" Ctrl="yes" Alt="yes" Shift="yes" Key="86">
    <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
    <Action type="3" message="1601" wParam="0" lParam="0" sParam="(.+)" />
    <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
    <Action type="3" message="1602" wParam="0" lParam="0" sParam=' \1 = &quot;\1&quot;' />
    <Action type="3" message="1702" wParam="0" lParam="768" sParam="" />
    <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
    <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
    <Action type="3" message="1601" wParam="0" lParam="0" sParam=" (\S)(\S+)(?=.*=)" />
    <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
    <Action type="3" message="1602" wParam="0" lParam="0" sParam="\u\1\L\2" />
    <Action type="3" message="1702" wParam="0" lParam="768" sParam="" />
    <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
    <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
    <Action type="3" message="1601" wParam="0" lParam="0" sParam="^(.)" />
    <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
    <Action type="3" message="1602" wParam="0" lParam="0" sParam="\l\1" />
    <Action type="3" message="1702" wParam="0" lParam="768" sParam="" />
    <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
    <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
    <Action type="3" message="1601" wParam="0" lParam="0" sParam="[^\w\r\n](?!=)(?=.*=)" />
    <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
    <Action type="3" message="1602" wParam="0" lParam="0" sParam="" />
    <Action type="3" message="1702" wParam="0" lParam="768" sParam="" />
    <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
</Macro>

FYI `[A-z]` matches more than just letters. Have a look at an [ASCII table](https://www.ascii-code.com/). — Toto, Feb 18 '20 at 19:16
Ah... thanks Toto... so I guess its negation will match less than we want. What would be a match for just a letter... \w? — Colm Bhandal, Feb 18 '20 at 19:48
`\w` is digits and underscores in addition to upper and lowercase letters. If you want a letter use `[A-Za-z]`. — ggorlen, Feb 19 '20 at 01:15