regular expression in Python to update string in a file

Question

Anything that starts with <a class=“rms-req-link” href=“https://rms. AND ends with </a> should be replaced by TBD.

Example:

<a class=“req-link” href=“https://doc.test.com/req_view/ABC-3456">ABC-3456</a>

or:

<a class=“req-link” href=“https://doc.test.com/req_view/ABC-1234">ABC-1234</a>

Such strings should be replaced by TBD in the file.

Code I tried:

import re

output = open("regex1.txt","w")
input = open("regex.txt")

for line in input:
    output.write(re.sub(r"^<a class=“req-link” .*=“https://([a-zA-Z]+(\.[a-zA-Z]+)+).*</a>$", 'TBD', line))

input.close()
output.close()

Please take the time to properly format the code in your question. — Daniil Fajnberg, Dec 01 '22 at 15:40
Obligatory: https://stackoverflow.com/a/1732454/2221001 And while that is more-or-less a joke answer, it likely fits here. It may make more sense to use a module that is purpose built to parse HTML to perform this task. — JNevill, Dec 01 '22 at 15:43
Also: Obligatory mention of [regex101.com](https://regex101.com/) — Daniil Fajnberg, Dec 01 '22 at 15:44
As for the specifics of your question, you say in the first line that the pattern is that it starts with ` — JNevill, Dec 01 '22 at 15:44
You are doing `$` but you're going over lines of a file, so it's probably `\n$` — Tomerikoo, Dec 01 '22 at 15:48

score 0 · Answer 1 · answered Dec 01 '22 at 16:04

As mentioned in the comments, the pattern you mention does not match the one you use in your code, nor does it correspond to the example strings you want replaced. So you may or may not want to adjust the following pattern depending on what you actually need.

import re
from pathlib import Path


PATTERN = re.compile(r'<a\s+class=“req-link”\s+href=“https://.*?</a>')


def replace_a_tags(input_file: str, output_file: str) -> None:
    contents = Path(input_file).read_text()
    with Path(output_file).open("w") as f:
        f.write(re.sub(PATTERN, "TBD", contents))


if __name__ == "__main__":
    replace_a_tags("input.txt", "output.txt")

The .*? is important to match lazily (as opposed to greedily) so that it matches any character (.) between zero and unlimited times, as few times as possible until it hits the closing anchor tag.

The pattern matches both your example strings.

The Path.read_text method obviously reads the entire file into memory, so that may be a problem, if it happens to be gigantic, but I doubt it. The benefit is that the global regex replacement is much more efficient than iterating over each line in the file individually.

regular expression in Python to update string in a file

1 Answers1