0

Given an input file data.dat like this one:

# Some comment
# more comments
#
45.78
# aaa
0.056
0.67
# aaa
345.
0.78
99.
2.34
# aaa
65.7
0.9

I need to add different comments above each line that starts with an "# aaa" so it looks like this:

# Some comment
# more comments
#
45.78
# cmmt1
# aaa
0.056
0.67
# another cmmt
# aaa
345.
0.78
99.
2.34
# last one
# aaa
65.7
0.9

I know a priori the number of "# aaa" comments present in the data.dat file, but not their positions.

I have a way to do it (see code below) but it is quite complicated and not at all efficient. I need to apply this code to hundreds of large files, so I'm looking for an efficient way to do this.


# Read file
with open("data.dat", mode="r") as f:
    data = f.readlines()

# Indexes of "# aaa" comments
idx = []
for i, line in enumerate(data):
    if line.startswith("# aaa"):
        idx.append(i)

# Insert new comments in their proper positions
add_data = ["# cmmt1\n", "# another cmmt\n", "# last one\n"]
for i, j in enumerate(idx):
    data.insert(j + i, add_data[i])

# Write final data to file
with open("data_final.dat", mode="w") as f:
    for item in data:
        f.write("{}".format(item))
Gabriel
  • 40,504
  • 73
  • 230
  • 404

3 Answers3

2

I didn't do any benchmarks, but re.sub could be faster - just load the text file as whole, execute re.sub and write it out:

data = '''# Some comment
# more comments
#
45.78
# aaa
0.056
0.67
# aaa
345.
0.78
99.
2.34
# aaa
65.7
0.9'''

import re

def fn():
    add_data = ["# cmmt1\n", "# another cmmt\n", "# last one\n"]
    for d in add_data:
        yield d

out = re.sub(r'^# aaa', lambda r, f=fn(): next(f) + r.group(0), data, flags=re.MULTILINE)
print(out)

Prints:

# Some comment
# more comments
#
45.78
# cmmt1
# aaa
0.056
0.67
# another cmmt
# aaa
345.
0.78
99.
2.34
# last one
# aaa
65.7
0.9

With files input/output:

import re

def fn():
    add_data = ["# cmmt1\n", "# another cmmt\n", "# last one\n"]
    for d in add_data:
        yield d

with open('data.dat', 'r') as f_in, \
    open('data.out', 'w') as f_out:
    f_out.write(re.sub(r'^# aaa', lambda r, f=fn(): next(f) + r.group(0), f_in.read(), flags=re.MULTILINE))

Version 2:

import re

def fn():
    add_data = ["# cmmt1\n", "# another cmmt\n", "# last one\n"]
    add_data = [s + '#aaa' for s in add_data]
    for d in add_data:
        yield d

with open('data.dat', 'r') as f_in, \
    open('data.out', 'w') as f_out:
    f_out.write(re.sub(r'^# aaa', lambda r, f=fn(): next(f), f_in.read(), flags=re.MULTILINE))
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Reading the file as a string with `data = f.read()` (to match your input of `data` as a string) I'm getting the error: `TypeError: '_sre.SRE_Match' object has no attribute '__getitem__'` – Gabriel May 29 '19 at 13:58
  • @Gabriel I updated my answer for file i/o. The file `data.dat` contains the data as in previous example. – Andrej Kesely May 29 '19 at 14:02
  • Still same error under Python 2.7.16, but no under 3.7.2; guess something changed in the `re` module. – Gabriel May 29 '19 at 14:09
  • @Gabriel Yes, could be. I'm using Python 3.7. – Andrej Kesely May 29 '19 at 14:10
  • 1
    I did a simple benchmark and this method is almost 20% faster than mine. Too bad it does not work with v2.7, but anyway thank you! – Gabriel May 29 '19 at 14:12
  • 1
    @Gabriel For python 2.7 just susbstitute `r[0]` to `r.group(0)`. It should work. I updated my example. – Andrej Kesely May 29 '19 at 14:18
  • it does work (thank you) but under v2.7 this is actually a bit slower than my original method according to my benchmarks. It is still almost 20% faster under v3.x. – Gabriel May 29 '19 at 14:31
  • @Gabriel How big is your data? Maybe the string concatenation is slowing things down. Try this instead `+` string operator: `"{}{}".format(next(f), r.group(0))` – Andrej Kesely May 29 '19 at 14:33
  • My actual data is about 10 Mb per file. This change introduces not much of an improvement in either version. – Gabriel May 29 '19 at 14:54
  • @Gabriel: Ok, 10 Mb isn't too much. It fits the memory with ease. Another attempt: we can get rid of `r.group(0)`. Change the lambda to: `next(f) + '# aaa'` - because we know the string in advance. – Andrej Kesely May 29 '19 at 15:23
  • @Gabriel Or better, try version 2, I updated my answer - in this version I precompute data that are added to output file. – Andrej Kesely May 29 '19 at 15:30
  • Version 2 s actually quite slower than the old one, in both v2.7 and v3.x. – Gabriel May 29 '19 at 16:57
1

According to Jan-Philip Gehrcke's response here, you should reduce the number of write calls.

To do so you could maybe simply change :

with open("data_final.dat", mode="w") as f:
    for item in data:
        f.write("{}".format(item))

to :

with open("data_final.dat", mode="w") as f:
    f.write("".join(data))
H4kim
  • 418
  • 5
  • 8
  • This simple change works in v2.7 and v3.x and improves the performance as much as Andrej's answer. Thank you! – Gabriel May 29 '19 at 14:15
  • 1
    Correction: this is the best performer under v2.7, but not under v3.x (Andrej's is) – Gabriel May 29 '19 at 14:32
1

When I need to change data in a text file, I try to read with one handle and immediately write with a second one.

def add_comments(input_file_name, output_file_name, list_of_comments):
    comments = iter(list_of_comments)  # or itertools.cycle(list_of_comments)
    with open(input_file_name) as fin, open(output_file_name, 'w') as fout:
        for line in fin:
            if line.startswith("# aaa"):
                fout.write(next(comments))
            fout.write(line)

For your example code, if would be called as:

add_comments("data.dat", "final_data.dat", ["# cmmt1\n", "# another cmmt\n", "# last one\n"])
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • This improves the performance as much as H4kim's method, about 6% in Python v3.x, but it runs about .5% *slower* than the original method under Python 2.7 (as does Andrej's method) Thanks! – Gabriel May 29 '19 at 14:25
  • 1
    @Gabriel: Andrej changed (a lot of) memory for (a little) speed. Remember that if you intend to process huge files (close of greater than available memory). – Serge Ballesta May 29 '19 at 14:46