4

I have a dataset that looks like this:

Male    Name=Tony;  
Female  Name=Alice.1; 
Female  Name=Alice.2;
Male    Name=Ben; 
Male    Name=Shankar; 
Male    Name=Bala; 
Female  Name=Nina; 
###
Female  Name=Alex.1; 
Female  Name=Alex.2;
Male    Name=James; 
Male    Name=Graham; 
Female  Name=Smith;  
###
Female  Name=Xing;
Female  Name=Flora;
Male    Name=Steve.1;
Male    Name=Steve.2; 
Female  Name=Zac;  
###

I want to the change the list so it looks like this:

Male    Name=Class_1;
Female  Name=Class_1.1;
Female  Name=Class_1.2;
Male    Name=Class_1;
Male    Name=Class_1;
Male    Name=Class_1; 
Female  Name=Class_1;
###
Female  Name=Class_2.1; 
Female  Name=Class_2.2; 
Male    Name=Class_2; 
Male    Name=Class_2; 
Female  Name=Class_2;  
###
Female  Name=Class_3; 
Female  Name=Class_3; 
Male    Name=Class_3.1; 
Male    Name=Class_3.2; 
Female  Name=Class_3;
###

Each name has to be changed to the class they belong to. I noticed that in the dataset, each new class in the list is denoted by a ‘###’. So I can split the data set into blocks by ‘###’ and count the instances of ###. Then use regex to look for the names, and replace them by the count of ###.

My code looks like this:

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=(.*?)[;/]'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:
    match = re.findall(pattern, line)
    print match

for line in blocks:
    if line == '###':
        triple_hash_count += 1
        print line 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count))) 

This doesn’t seem to do the job - no replacements are made.

Jan
  • 42,290
  • 8
  • 54
  • 79

3 Answers3

1

When running the code you provided, I got the following traceback output:

print(line.replace(match, prefix + str(triple_hash_count))) 
TypeError: Can't convert 'list' object to str implicitly

The error happens because type(match) evaluates to a list. When I inspect this list in PDB, it's an empty list. This is because match has gone out of scope by having two for-loops. So let's combine them as such:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

Now you're getting content in match, but there's still a problem: the return type of re.findall is a list of strings. str.replace(...) expects a single string as its first argument.

You could cheat, and change the offending line to print(line.replace(match[0], prefix + str(triple_hash_count))) -- but that presumes that you're sure you're going to find a regular expression match on every line that isn't ###. A more resilient way would be to check to see that you have the match before you try to call str.replace() on it.

The final code looks like this:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else:
        if match: 
            print(line.replace(match[0], prefix + str(triple_hash_count)))
        else:
            print(line)

Two more things:

  1. On line 11, you mistook the variable name. It's triple_hash_count, not hash_count.
  2. This code won't actually change the text file provided as input on line 1. You need to write the result of line.replace(match, prefix + str(triple_hash_count)) back to the file, not just print it.
DahliaSR
  • 77
  • 6
Matthew Cole
  • 602
  • 5
  • 21
1

The problem is rooted in the use of a second loop (as well as a mis-named variable). This will work.

import re

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=([^\.\d;]*)'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:

    if line == '###':
        triple_hash_count += 1
        print line     
    else:
        match = re.findall(pattern, line)
        print line.replace(match[0], prefix + str(triple_hash_count)) 
Paul Back
  • 1,269
  • 16
  • 23
  • the `\d` and `*` in the regex `[^\.\d;]*` are not required! This: `r'=(.*?)[\.;]'` does all it needs – DahliaSR Mar 25 '17 at 21:42
1

While you already have your answer, you can do it in just a couple of lines with regular expressions (it could even be a one-liner but this is not very readable):

import re
hashrx = re.compile(r'^###$', re.MULTILINE)
namerx = re.compile(r'Name=\w+(\.\d+)?;')

new_string = '###'.join([namerx.sub(r"Name=Class_{}\1".format(idx + 1), part) 
                for idx,part in enumerate(hashrx.split(string))])
print(new_string)

What it does:

  1. First, it looks for ### in a single line with the anchors ^ and $ in MULTILINE mode.
  2. Second, it looks for a possible number after the Name, capturing it into group 1 (but made optional as not all of your names have it).
  3. Third, it splits your string by ### and iterates over it with enumerate(), thus having a counter for the numbers to be inserted.
  4. Lastly, it joins the resulting list by ### again.

As a one-liner (though not advisable):

new_string = '###'.join(
                [re.sub(r'Name=\w+(\.\d+)?;', r"Name=Class_{}\1".format(idx + 1), part) 
                for idx, part in enumerate(re.split(r'^###$', string, flags=re.MULTILINE))])

Demo

A demo says more than thousands words.

Jan
  • 42,290
  • 8
  • 54
  • 79