2

I would like to apply the function .lower() to a string for all of the words that are predefined in a list, but not for any other words. For instance, take the string provided below.

string1 = "ThE QuIcK BroWn foX jUmpEd oVer thE LaZY dOg."

Now say I have a list as seen below:

lower_list = ['quick', 'jumped', 'dog']

My ideal output would be for the function to apply the .lower() to the entire string like this:

string1.lower()

And then the output have the function only apply the .lower() to the instances in string1 that are in the list lower_list as appears below:

> ThE quick BroWn foX jumped oVer thE LaZY dog.

Can this be done in a simple manner? My idea was to use a for loop, but I need to retain the formatting of the string for example say a string has multiple lines and indents on some lines and not others.

EDIT: I am getting the following error

parts[1::2] = (word.lower() for word in parts[1::2]) 
AttributeError: 'NoneType' object has no attribute 'lower'

I believe this might be due to have characters other than letters in the strings i use in lower_list. If I were to have a string like this '(copy)' then I get the above error. Is there a way to get around this? I was thinking of making every split part into a string using str(xxx) but not sure how to do that...

Chandler Cree
  • 111
  • 10

6 Answers6

6

For this kind of problem you should be careful about cases like this one:

>>> phrase = 'the apothecary'
>>> phrase.replace('the', 'THE')
'THE apoTHEcary'

That is, you only want to do the replacements for whole word matches, but it is quite difficult to only match whole words by direct string manipulations, because the boundary of a word can be at a space ' ' character, but it could also be at a full stop '.' or at the start or end of the input string.

Fortunately, regexes make it easy to match whole words, because \b in a regex matches any word boundary. So we can solve the problem this way:

  • Create a regex which matches the words in lower_list, case-insensitive, but only when they have a word boundary before and after them.
  • Split the input string into parts using the regex, capturing the matches.
  • Transform each of the captured matches to lowercase.
  • Join the parts back again.

Because we're splitting on words rather than spaces, this means the original whitespace is preserved exactly. Here's an implementation:

import re

def lowercase_words(string, words):
    regex = r'\b(' + '|'.join(words) + r')\b'
    parts = re.split(regex, string, flags=re.IGNORECASE)
    parts[1::2] = (word.lower() for word in parts[1::2])
    return ''.join(parts)

Example:

>>> lowercase_words(string1, lower_list)
'ThE quick BroWn foX jumped oVer thE LaZY dog.'
>>> lowercase_words('ThE aPoThEcArY', ['the'])
'the aPoThEcArY'
>>> lowercase_words('  HELLO   \n WORLD  ', ['hello', 'world'])
'  hello   \n world  '

The above assumes that the words in lower_list only contain letters. If they might contain other characters, then there are two more problems:

  • We need to escape the special characters, with re.escape.
  • We only want to match word boundaries using \b if the word starts and/or ends with a letter.

The following makes it work:

import re

def lowercase_words(string, words):
    def make_regex_part(word):
        word = re.escape(word)
        if word[:1].isalpha(): word = r'\b' + word
        if word[-1:].isalpha(): word += r'\b'
        return word

    regex = '(' + '|'.join(map(make_regex_part, words)) + ')'
    parts = re.split(regex, string, flags=re.IGNORECASE)
    parts[1::2] = (word.lower() for word in parts[1::2])
    return ''.join(parts)

Example:

>>> lowercase_words('(TrY) iT nOw WiTh bRaCkEtS', ['(try)', 'it'])
'(try) it nOw WiTh bRaCkEtS'
kaya3
  • 47,440
  • 4
  • 68
  • 97
  • 1
    This answer is complete IMO than the accepted answer. "Upvoting". – Ch3steR Feb 04 '20 at 17:58
  • Hello thanks for the great response! I am however getting an error `parts[1::2] = (word.lower() for word in parts[1::2]) AttributeError: 'NoneType' object has no attribute 'lower'` I believe this might be due to have characters other than letters in the strings i use in `lower_list`. If I were to have a string like this `'(copy)'` then I get the above error. Is there a way to get around this? I was thinking of making every split part into a string using `str(xxx)` but not sure how to do that... Thanks @kaya3 – Chandler Cree Feb 04 '20 at 21:10
  • @ChandlerCree That'll be because `(` and `)` are special characters in regexes. The solution will be to escape the special characters, but it's a bit trickier since `(` and `)` won't necessarily be on word boundaries. – kaya3 Feb 04 '20 at 21:38
  • 1
    (See edit.) That's assuming you only want to make `copy` lowercase when it's in brackets. Otherwise you can just strip the brackets from it before calling this function. – kaya3 Feb 04 '20 at 21:49
  • This works perfectly. Solved the issue and now the script is running exactly as needed. Best! @kaya3 – Chandler Cree Feb 04 '20 at 21:55
4

Try this:

string1 = "ThE QuIcK BroWn foX jUmpEd oVer thE LaZY dOg."
lower_list = ['quick', 'jumped', 'dog']

output = ' '.join(s.lower() if s.lower().strip(".") in lower_list else s for s in string1.split())
print(output)

Output:

ThE quick BroWn foX jumped oVer thE LaZY dog.

To preserve formatting, use this:

string1 = "ThE QuIcK BroWn foX jUmpEd oVer thE\n\tLaZY dOg."
lower_list = ['quick', 'jumped', 'dog']

output = ''
word = ''
for c in string1:
    if c in ("\n", "\t", " ", "."):
        if word.lower() in lower_list:
            word = word.lower()
        output += word + c
        word = ''
    else:
        word += c
output += word
print(output)

Ouput:

ThE quick BroWn foX jumped oVer thE
    LaZY dog.
Ed Ward
  • 2,333
  • 2
  • 10
  • 16
  • 2
    This removes newlines and tabs, doesn't it? – LeoE Feb 04 '20 at 17:13
  • umm yes it does... Do you need a version which doesn't? – Ed Ward Feb 04 '20 at 17:14
  • 1
    Well to cite OP: _My idea was to use a for loop, but I need to retain the formatting of the string for example say a string has multiple lines and indents on some lines and not others._ – LeoE Feb 04 '20 at 17:16
1

Using re.sub. Define a replacement function:

lower_list = ['quick', 'jumped', 'dog']

def lower_selected_words(matchobj):
    word = matchobj.group(0)
    word_lower = word.lower()
    if word_lower in lower_list:
        return word_lower
    return word

And then replace:

re.sub(r'\b(\w+)\b', lower_selected_words, string1)
Paulo Almeida
  • 7,803
  • 28
  • 36
0

You should make a list comprehension like this. However, it does a for loop

' '.join([i.lower() if i.lower().strip('.') in lower_list else i for i in string1.split()])
santo
  • 418
  • 1
  • 3
  • 13
0

here's a simple solution to this,

string1 = "ThE QuIcK BroWn foX jUmpEd oVer thE LaZY dOg."
lower_list = ['quick', 'jumped', 'dog']
l = string1.split()

for i in range(0, len(l)):
  if l[i].lower().strip('.') in lower_list:
    l[i] = l[i].lower()

print(" ".join(l))

hrishikeshpaul
  • 459
  • 1
  • 11
  • 27
0

I used regex to solve this:

import re
string1 = "ThE QuIcK BroWn foX jUmpEd oVer thE LaZY dOg."

matches = re.finditer(r'\b(quick|jumped|brown)\b', string1, flags=re.IGNORECASE)

indicies = [(m.start(0), m.end(0)) for m in matches]

broken_string = []
end_index = 0
for index in indicies:
    broken_string.append(string1[end_index:index[0]])
    broken_string.append(string1[index[0]:index[1]].lower())
    end_index = index[1]

broken_string.append(string1[end_index:])

print("".join(broken_string))

output:

ThE quick brown foX jumped oVer thE LaZY dOg.

The benenfit of this method is that it preserver whitespace and also won't break compound words such as quick vs quickly

stackErr
  • 4,130
  • 3
  • 24
  • 48