How to replace a pattern using regex in python?

Question

I have a dataset that looks like this:

Male    Name=Tony;  
Female  Name=Alice.1; 
Female  Name=Alice.2;
Male    Name=Ben; 
Male    Name=Shankar; 
Male    Name=Bala; 
Female  Name=Nina; 
###
Female  Name=Alex.1; 
Female  Name=Alex.2;
Male    Name=James; 
Male    Name=Graham; 
Female  Name=Smith;  
###
Female  Name=Xing;
Female  Name=Flora;
Male    Name=Steve.1;
Male    Name=Steve.2; 
Female  Name=Zac;  
###

I want to the change the list so it looks like this:

Male    Name=Class_1;
Female  Name=Class_1.1;
Female  Name=Class_1.2;
Male    Name=Class_1;
Male    Name=Class_1;
Male    Name=Class_1; 
Female  Name=Class_1;
###
Female  Name=Class_2.1; 
Female  Name=Class_2.2; 
Male    Name=Class_2; 
Male    Name=Class_2; 
Female  Name=Class_2;  
###
Female  Name=Class_3; 
Female  Name=Class_3; 
Male    Name=Class_3.1; 
Male    Name=Class_3.2; 
Female  Name=Class_3;
###

Each name has to be changed to the class they belong to. I noticed that in the dataset, each new class in the list is denoted by a ‘###’. So I can split the data set into blocks by ‘###’ and count the instances of ###. Then use regex to look for the names, and replace them by the count of ###.

My code looks like this:

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=(.*?)[;/]'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:
    match = re.findall(pattern, line)
    print match

for line in blocks:
    if line == '###':
        triple_hash_count += 1
        print line 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

This doesn’t seem to do the job - no replacements are made.

Possible duplicate of [Python string.replace regular expression](http://stackoverflow.com/questions/16720541/python-string-replace-regular-expression) — Fermat's Little Student, Mar 25 '17 at 19:53
If you're actually using curly quotes, that's not valid Python syntax. Are you programming in Word or something? — jonrsharpe, Mar 25 '17 at 19:54
what does this mean? Oh no sorry, I copied my code from a text file to here. Silly — Python_Newbie_2, Mar 25 '17 at 19:55
Hi will - the answers on the posts you suggested are not particularly helpful for me — Python_Newbie_2, Mar 25 '17 at 19:59
My issue lies with the fact that no replacements are made to match (to the pattern I have given). — Python_Newbie_2, Mar 25 '17 at 20:05
`match` in `print(line.replace(match, prefix + str(triple_hash_count)))` is out of scope ... you have two separate for-loops happening. `match` is only valid in the first for-loop. — Matthew Cole, Mar 25 '17 at 20:05

score 1 · Accepted Answer · edited Mar 26 '17 at 02:22

When running the code you provided, I got the following traceback output:

print(line.replace(match, prefix + str(triple_hash_count))) 
TypeError: Can't convert 'list' object to str implicitly

The error happens because type(match) evaluates to a list. When I inspect this list in PDB, it's an empty list. This is because match has gone out of scope by having two for-loops. So let's combine them as such:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

Now you're getting content in match, but there's still a problem: the return type of re.findall is a list of strings. str.replace(...) expects a single string as its first argument.

You could cheat, and change the offending line to print(line.replace(match[0], prefix + str(triple_hash_count))) -- but that presumes that you're sure you're going to find a regular expression match on every line that isn't ###. A more resilient way would be to check to see that you have the match before you try to call str.replace() on it.

The final code looks like this:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else:
        if match: 
            print(line.replace(match[0], prefix + str(triple_hash_count)))
        else:
            print(line)

Two more things:

On line 11, you mistook the variable name. It's triple_hash_count, not hash_count.
This code won't actually change the text file provided as input on line 1. You need to write the result of line.replace(match, prefix + str(triple_hash_count)) back to the file, not just print it.

Your answer is correct, but OP's regex needed a tweak to address the lines with the trailing '.1', '.2', etc — Paul Back, Mar 25 '17 at 20:41
@PaulBack I noticed that you put that change in your post. But I'd recommend `pattern = r'Name=([^\.\d;]*)` so that it doesn't ingest the period between the name and the uniqueness counter. — Matthew Cole, Mar 25 '17 at 20:49

Paul Back · Answer 2 · 2017-03-25T20:54:41.110

1

The problem is rooted in the use of a second loop (as well as a mis-named variable). This will work.

import re

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=([^\.\d;]*)'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:

    if line == '###':
        triple_hash_count += 1
        print line     
    else:
        match = re.findall(pattern, line)
        print line.replace(match[0], prefix + str(triple_hash_count))

edited Mar 25 '17 at 20:54

answered Mar 25 '17 at 20:32

Paul Back

1,269
16
23

the `\d` and `*` in the regex `[^\.\d;]*` are not required! This: `r'=(.*?)[\.;]'` does all it needs – DahliaSR Mar 25 '17 at 21:42

Jan · Answer 3 · 2017-03-25T21:52:12.410

While you already have your answer, you can do it in just a couple of lines with regular expressions (it could even be a one-liner but this is not very readable):

import re
hashrx = re.compile(r'^###$', re.MULTILINE)
namerx = re.compile(r'Name=\w+(\.\d+)?;')

new_string = '###'.join([namerx.sub(r"Name=Class_{}\1".format(idx + 1), part) 
                for idx,part in enumerate(hashrx.split(string))])
print(new_string)

What it does:

First, it looks for ### in a single line with the anchors ^ and $ in MULTILINE mode.
Second, it looks for a possible number after the Name, capturing it into group 1 (but made optional as not all of your names have it).
Third, it splits your string by ### and iterates over it with enumerate(), thus having a counter for the numbers to be inserted.
Lastly, it joins the resulting list by ### again.

As a one-liner (though not advisable):

new_string = '###'.join(
                [re.sub(r'Name=\w+(\.\d+)?;', r"Name=Class_{}\1".format(idx + 1), part) 
                for idx, part in enumerate(re.split(r'^###$', string, flags=re.MULTILINE))])

Demo

A demo says more than thousands words.

How to replace a pattern using regex in python?

3 Answers3

What it does:

As a one-liner (though not advisable):

Demo