1

I have a python list with each string being one of the following 4 possible options like this (of course the names would be different):

Mr: Smith\n
Mr: Smith; John\n
Smith\n
Smith; John\n

I want these to be corrected to:

Mr,Smith,fname\n
Mr,Smith,John\n
title,Smith,fname\n
title,Smith,John\n

Easy enough to do with 4 re.sub():

with open ("path/to/file",'r') as fileset:
    dataset = fileset.readlines()
for item in dataset:
    dataset = [item.strip() for item in dataset]    #removes some misc. white noise
    item = re.sub((.*):\W(.*);\W,r'\g<1>'+','+r'\g<2>'+',',item)
    item = re.sub((.*);\W(.*),'title,'+r'\g<1>'+','+r'\g<2>',item)
    item = re.sub((.*):\W(.*),r'\g<1>'+','+r'\g<2>'+',fname',item)
    item = re.sub((*.),'title,'+r'\g<1>'+',fname',item)

While this is fine for the dataset I'm using, I want to be more efficient.
Is there a single operation that can simplify this process?

Please pardon if I forgot a quote or some such; I'm not at my workstation now and I'm aware I've stripped the newline (\n).

Thank you,

t.r.orion
  • 13
  • 4
  • 2
    Compiling the regexes in advance helps because it can cache the state machines it generates. Backreferences can be slow as well. – Beefster Jan 05 '18 at 21:11
  • Question: what do you want to do *in fine* with your dataset? Do you want to write a new (corrected) file? – Casimir et Hippolyte Jan 05 '18 at 21:22
  • What's this logic with `for item in dataset:`? You have a double, nested loop. – Eric Duminil Jan 05 '18 at 21:23
  • 1
    @Beefster: The difference is often negligible with a compiled regex because the string regex gets compiled and stored in cache anyway. – Eric Duminil Jan 05 '18 at 21:24
  • In fine I will be inserting to a database but that is a couple of steps out. shorter term I will be using the information to get data from a mix of web pages. – t.r.orion Jan 05 '18 at 22:35

2 Answers2

2

Brief

Instead of running two loops, you can reduce it to just one line. Adapted from How to iterate over the file in Python (and using the code in my Code section):

f = open("path/to/file",'r')
while True:
    x = f.readline()
    if not x: break
    print re.sub(r, repl, x)

See Python - How to use regexp on file, line by line, in Python for other alternatives.


Code

For viewing sake I've changed your file to an array.

See regex in use here

^(?:([^:\r\n]+):\W*)?([^;\r\n]+)(?:;\W*(.+))?

Note: You don't need all that in python, I do in order to show it on regex101, so your regex would actually just be ^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?

Usage

See code in use here

import re

a = [
    "Mr: Smith",
    "Mr: Smith; John",
    "Smith",
    "Smith; John"
]
r = r"^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?"

def repl(m):
    return (m.group(1) or "title" ) + "," + m.group(2) + "," + (m.group(3) or "fname")

for s in a:
    print re.sub(r, repl, s)

Explanation

  • ^ Assert position at the start of the line
  • (?:([^:]+):\W*)? Optionally match the following
    • ([^:]+) Capture any character except : one or more times into capture group 1
    • : Match this literally
    • \W* Match any number of non-word characters (copied from OP's original code, I assume \s* can be used instead)
  • ([^;]+) Group any character except ; one or more times into capture group 2
  • (?:;\W*(.+))? Optionally match the following
    • ; Match this literally
    • \W* Match any number of non-word characters (copied from OP's original code, I assume \s* can be used instead)
    • (.+) Capture any character one or more times into capture group 3

Given the above explanation of the regex part. The re.sub(r, repl, s) works as follows:

  • repl is a callback to the repl function which returns:
    • group 1 if it captured anything, title otherwise
    • group 2 (it's supposedly always set - using OP's logic here again)
    • group 3 if it captured anything, fname otherwise
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • 1
    Yes, even if your suggestion reduces four patterns to only one (that will probably speed up the code), note that the main problem in his code are not the patterns but the redundant and nested loops. – Casimir et Hippolyte Jan 05 '18 at 21:39
  • 1
    I think he has a typo: there should have been `with`, instead of `while` on the line where he opens the file, and next line reads all lines from file with `readlines` – Igor Nikolaev Jan 05 '18 at 21:45
  • 2
    Good job, but without to be an expert in python good practises, I think that Mike Pennington design (see your other alternatives link) is less old school than the one you suggested. Something like [this](https://tio.run/##Pco9CsAgDEDhvafIpoLUoVtv00LEgMQQU0pPb3@GfuPjyWWl8TLGSVagCbJ3sllJ1lKmii6CUxdg65DLOsEjN4VKjED8t5cosXnFuR@71wiKUuN3zt2UxIcwxg0) – Casimir et Hippolyte Jan 05 '18 at 21:55
1

IMHO, RegEx are just too complex here, you can use classic string function to split your string item in chunks. For that, you can use partition (or rpartition).

First, split your item string in "records", like that:

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
records = item.splitlines()
# -> ['Mr,Smith,fname', 'Mr,Smith,John', 'title,Smith,fname', 'title,Smith,John']

Then, you can create a short function to normalize each "record". Here is an example:

def normalize_record(record):
    # type: (str) -> str
    name, _, fname = record.partition(';')
    title, _, name = name.rpartition(':')
    title = title.strip() or 'title'
    name = name.strip()
    fname = fname.strip() or 'fname'
    return "{0},{1},{2}".format(title, name, fname)

This function is easier to understand than a collection of RegEx. And, in most case, it is faster.

For a better integration, you can define another function to handle each item:

def normalize(row):
    records = row.splitlines()
    return "\n".join(normalize_record(record) for record in records) + "\n"

Demo:

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
item = normalize(item)

You get:

'Mr,Smith,fname\nMr,Smith,John\ntitle,Smith,fname\ntitle,Smith,John\n'
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103
  • And regex patterns would be more complex than that? – Borodin Jan 09 '18 at 21:11
  • Obviously, have you taken a look at the accepted answer? – Laurent LAPORTE Jan 09 '18 at 21:52
  • Yes, although it's not laid out well and it's longer than necessary. I understand how regular expressions can be an anathema, but I spent ten years writing Python and another eight with Perl and its built-in regex engine. I would now far rather read the regex solution than your cryptic Python. – Borodin Jan 09 '18 at 22:14