3

I've got a file whose format I'm altering via a python script. I have several camel cased strings in this file where I just want to insert a single space before the capital letter - so "WordWordWord" becomes "Word Word Word" but I also have some abbreviations as well, Like in the text "General Manager or VP".

I found an answer from David Underhill in this post:

A pythonic way to insert a space before capital letters

While this answer helps me to not insert spaces between abbreviations inside the text like "DaveIsAFKRightNow!Cool"

But it sure inserts a space between V and P in "VP".

I only have 25 experience points and i am unable to comment on the existing post, i am left with no other choice than to create another post for this similar sort of problem.

I am not that good at RegEx and not able to figure how to handle this situation.

I have tried this:

re_outer = re.compile(r'([^A-Z ])([A-Z])')
re_inner = re.compile(r'(?<!^)([A-Z])([^A-Z])')
re_outer.sub(r'\1 \2', re_inner.sub(r' \1\2', 'DaveIsAFKRightNow!Cool'))

It gives me 'Dave Is AFK Right Now! Cool'

My text sample is this:

General Manager or VP Torrance, CARequired education

I want the output as: General Manager or VP Torrance, CA Required education

The Output i am getting is: General Manager or V P Torrance, CA Required education

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Vaibhav Rathi
  • 340
  • 1
  • 4
  • 16

2 Answers2

2

You may swap the replacements to first insert spaces before uppercase letters that are preceded with chars other than uppercase letters and spaces, and then append a space before words that start with 1+ uppercase letters that are followed with an uppercase and a lowercase letter:

import re
re_outer = re.compile(r'([^A-Z ])([A-Z])')
re_inner = re.compile(r'\b[A-Z]+(?=[A-Z][a-z])')
print(re_inner.sub(r'\g<0> ', re_outer.sub(r'\1 \2', 'DaveIsAFKRightNow!Cool')))
# => Dave Is AFK Right Now! Cool
print(re_inner.sub(r'\g<0> ', re_outer.sub(r'\1 \2', 'General Manager or VP Torrance, CARequired education'))) 
# => General Manager or VP Torrance, CA Required education

See the Python demo

The \b[A-Z]+(?=[A-Z][a-z]) regex matches

  • \b - word boundary
  • [A-Z]+ - 1+ uppercase letters that are
  • (?=[A-Z][a-z]) - followed with an uppercase letter and a lowercase letter.

Note that \g<0> inserts the whole match in the replacement pattern.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

As an alternative, you might use a single pattern with an alternation:

((?<=[^\W[A-Z])[A-Z]|(?<=\S)[A-Z](?=[a-z]))

In the replacement use a space followed by group 1:

 \1

Explanation

  • ( Capture group
    • (?<= Positive lookahead, assert what is on the right is
      • [^\W[A-Z] Match a word character except for A-Z
    • ) Close positive lookahead
    • | Or
    • (?<=\S) Positive lookbehind, assert what is on the left is
    • [A-Z] Match A-Z
    • (?=[a-z]) Positive lookahead, assert what is on the right is a-z
  • ) Close capture group

Regex demo | Python demo

For example

import re

strings = [
    "General Manager or VP Torrance, CARequired education",
    "WordWordWord",
    "DaveIsAFKRightNow!Cool"
]
pattern = re.compile(r'((?<=[^\W[A-Z])[A-Z]|(?<=\S)[A-Z](?=[a-z]))')

for str in strings:
    print(pattern.sub(r' \1', str)) 

Result

General Manager or VP Torrance, CA Required education
Word Word Word
Dave Is AFK Right Now! Cool
The fourth bird
  • 154,723
  • 16
  • 55
  • 70