0

Anything that starts with <a class=“rms-req-link” href=“https://rms. AND ends with </a> should be replaced by TBD.

Example:

<a class=“req-link” href=“https://doc.test.com/req_view/ABC-3456">ABC-3456</a> 

or:

<a class=“req-link” href=“https://doc.test.com/req_view/ABC-1234">ABC-1234</a>

Such strings should be replaced by TBD in the file.

Code I tried:

import re

output = open("regex1.txt","w")
input = open("regex.txt")

for line in input:
    output.write(re.sub(r"^<a class=“req-link” .*=“https://([a-zA-Z]+(\.[a-zA-Z]+)+).*</a>$", 'TBD', line))

input.close()
output.close()
JNevill
  • 46,980
  • 4
  • 38
  • 63

1 Answers1

0

As mentioned in the comments, the pattern you mention does not match the one you use in your code, nor does it correspond to the example strings you want replaced. So you may or may not want to adjust the following pattern depending on what you actually need.

import re
from pathlib import Path


PATTERN = re.compile(r'<a\s+class=“req-link”\s+href=“https://.*?</a>')


def replace_a_tags(input_file: str, output_file: str) -> None:
    contents = Path(input_file).read_text()
    with Path(output_file).open("w") as f:
        f.write(re.sub(PATTERN, "TBD", contents))


if __name__ == "__main__":
    replace_a_tags("input.txt", "output.txt")

The .*? is important to match lazily (as opposed to greedily) so that it matches any character (.) between zero and unlimited times, as few times as possible until it hits the closing anchor tag.

The pattern matches both your example strings.

The Path.read_text method obviously reads the entire file into memory, so that may be a problem, if it happens to be gigantic, but I doubt it. The benefit is that the global regex replacement is much more efficient than iterating over each line in the file individually.

Daniil Fajnberg
  • 12,753
  • 2
  • 10
  • 41