String Replacement and Saving to a New File (Python v2.7)

Question

I am trying to replace all lines of a certain format with a blanks in a file i.e. replace a line of number/number/number (like a date) and number:number (like a time) with "". I want to read from the old file and then save the scrubbed version as a new file.

This is the code I have so far (I know it is way off):

old_file = open("old_text.txt", "r")
new_file = open("new_text.txt", "w")

print (old_file.read())

for line in old_file.readlines():
    cleaned_line = line.replace("%/%/%", "")
    cleaned_line = line.replace("%:%", "")
    new_file.write(cleaned_line)

old_file.close
new_file.close

Thank you for your help, Ben

change `old_file.close` to `old_file.close()` same for new_file — Foo Bar User, Oct 02 '13 at 00:12
[This question on the `with` command will be handy](http://stackoverflow.com/questions/9282967/how-to-open-a-file-using-the-open-with-statement) — , Oct 02 '13 at 00:14
You don't need `old_file.readlines():` in your for loop. You can just do `for line in old_file:` To be honest, before I read your code, I never even knew `readlines` even existed. — Shashank, Oct 02 '13 at 00:23
@ShashankGupta actually, in that code, `readlines` (or `for line in file`) won't do anything, as the `file.read()` call has seeked to the end of file. Meaning there is nothing to iterate over. — , Oct 02 '13 at 00:36
@ShashankGupta: Half the tutorials out there teach people to use `readlines`. And I have no idea why. If it were up to me, `readlines` without a `hint` argument would have been scrapped in 3.x, instead of just making the note about it being unnecessary slightly more prevalent in the file object docs (which nobody knows how to find in 3.x anyway). — abarnert, Oct 02 '13 at 00:37
@LegoStormtroopr: Well, it still creates an explicit empty list to iterate over… But yeah, not exactly the #1 problem until you fix the other half-dozen. — abarnert, Oct 02 '13 at 00:37

abarnert · Answer 1 · 2013-10-02T00:40:48.683

I am trying to replace all lines of a certain format with a blanks in a file i.e. replace a line of number/number/number (like a date) and number:number (like a time) with "".

You can't use str.replace to match a pattern or format, only a literal string.

To match a pattern, you need some kind of parser. For patterns like this, the regular expression engine built into the standard library as re is more than powerful enough… but you will need to learn how to write regular expressions for your patterns. The reference docs and Regular Expression HOWTO are great if you already know the basics; if not, you should search for a tutorial elsewhere.

Anyway, here's how you'd do this (fixing a few other things along the way, most of them explained by Lego Stormtroopr):

import re

with open("old_text.txt") as old_file, open("new_text.txt", "w") as new_file:
    for line in old_file:
        cleaned_line = re.sub(r'\d+/\d+/\d+', '', line)
        cleaned_line = re.sub(r'\d+:\d+', '', cleaned_line)
        new_file.write(cleaned_line)

Also, note that I used cleaned_line in the second sub; just using line again, as in your original code, means we lose the results of the first substitution.

Without knowing the exact definition of your problem, I can't promise that this does exactly what you want. Do you want to blank all lines that contain the pattern number/number/number, blank out all lines that are nothing but that pattern, blank out just that pattern and leave the rest of the line alone? All of those things are doable, and pretty easy, with re, but they're all done a little differently.

If you want to get a little trickier, you can use a single re.sub expression to replace all of the matching lines with blank lines at once, instead of iterating them one at a time. That means a slightly more complicated regexp vs. slightly simpler Python code, and it means probably better performance for mid-sized files but worse performance (and an upper limit) for huge files, and so on. If you can't figure out how to write the appropriate expression yourself, and there's no performance bottleneck to fix, I'd stick with explicit looping.

I am looking to blank out all lines that contain strings of the format number/number/number or number:number completly. I believe I can work out what to do from your and Lego Stormtroopr's answers. Thanks for the quick replies. — Ben, Oct 02 '13 at 03:48

score 0 · Accepted Answer · 2013-10-02T00:31:32.757

0

Firstly, there are some indentation issues, where the for loop was indented for no reason. Secondly as soon as you read the file you have seeked to the end, so there are no more lines to read. Lastly, the with command allows you to open a file and declare its variable name, and allow it to close due to error or reading to the end without having to worry about closing it manually.

To perform the actual logic, however, you probably want to use a regular expression. You can use re.search() to find the pattern

\d+:\d+ for any number of Digits , a colon and any number of Digits
\d+\/\d+\/d+ for three lots of any number of digits, with a literal / between them.

The code you want is closer to this:

import re
with open("old_text.txt", "r") as oldfile, open("new_text.txt", "w") as new_file:
    for line in old_file:
        # This will match if this pattern is anywhere in the line
        if re.search("\d+:\d+", line) is not None:
            line = ""
        # This will match if this pattern is anywhere in the line
        if re.search("\d+\/\d+\/d+", line) is not None:
            line = ""
        new_file.write(line)

If you only want to match at the beginning of the line, re.match() will probably be a better choice.

Here we declare a block with our two files, loop through the old_file, clean each line and write to the new_file. Once the end of the old_file is reached all the files are cleanly closed. If either file is not found, or an error occurs, the with block catches these and releases everything nicely.

edited Oct 02 '13 at 00:31

answered Oct 02 '13 at 00:16

2

The `readlines()` isn't adding anything here except for performance problems; just do `for line in old_file:`. – abarnert Oct 02 '13 at 00:21
2

More importantly, I don't think this actually solves the OP's problem. From his description, he's hoping to match all lines with the format number/number/number, not all lines with the literal string `%/%/%`, and that's the part he doesn't know how to do. – abarnert Oct 02 '13 at 00:22
@abarnert Updated to include the number matching logic. – Oct 02 '13 at 00:31
*Flat is better than nested.* *Sparse is better than dense.* Just because the `with` statement lets you do things in one line doesn't mean you should recommend using it. – Shashank Oct 02 '13 at 00:32
1

@ShashankGupta: Using a `with` for a pair of matching input and output files is a pretty common and accepted idiom in Python. In fact, it's the main rationale for adding the multi-context `with` statement tothe language. – abarnert Oct 02 '13 at 00:33
1

@ShashankGupta I'm doing things **`with`** these two files simultaneously. Why shouldn't they be together? – Oct 02 '13 at 00:34
@LegoStormtroopr: That's a very nice way to put it intuitively, which spares you having to think through the gritty details of "what do I want to happen if there's an exception here, or here, or here, or here…" in the most common cases. I'll have to remember that. – abarnert Oct 02 '13 at 18:18

String Replacement and Saving to a New File (Python v2.7)

2 Answers2