-1

I have a html file with href hyperlinks to certain files. I want to find all the href links in this file and replace the links with the absolute paths to the file.

<BR><CENTER><TABLE BORDER=1 CELLPADDING=0 NOSAVE><TR ALIGN=CENTER NOSAVE><TD COLSPAN="4" NOSAVE><CENTER><B>Summary Table</B></CENTER><TR><TD>Testname</TD><TD>Status</TD><TD>Link to HTML</TD><TD>Utility</TD></TR><TR><TD>test1</TD><TD>FAIL</TD><TD><A HREF= abc.html>HTML_report</a></BR></TD><TD>run</TD></TR><TR><TD>31Jan2017_03h12m52s</TD><TD>FAIL</TD><TD><A HREF=def.html>HTML_report</a></BR></TD><TD>run_2</TD></TR></TABLE></CENTER><BR>

After replacing, this should look like -

<BR><CENTER><TABLE BORDER=1 CELLPADDING=0 NOSAVE><TR ALIGN=CENTER NOSAVE><TD COLSPAN="4" NOSAVE><CENTER><B>Summary Table</B></CENTER><TR><TD>Testname</TD><TD>Status</TD><TD>Link to HTML</TD><TD>Utility</TD></TR><TR><TD>test1</TD><TD>FAIL</TD><TD><a href=common?htmlview=1&file="absolute_path to abc.html">HTML_report</a></BR></TD><TD>run</TD></TR><TR><TD>31Jan2017_03h12m52s</TD><TD>FAIL</TD><TD><a href=common?htmlview=1&file="absolute_path to def.html">HTML_report</a></BR></TD><TD>run_2</TD></TR></TABLE></CENTER><BR>

I am reading the html file line by line, so a single line can have more than one occurrence of a href. I have tried using re.sub to find and place -

re.sub(r'\sA\sHREF\s','a href=common?htmlview=1&file=<path>',line) 
Mohammad Yusuf
  • 16,554
  • 10
  • 50
  • 78
NSU
  • 37
  • 1
  • 7

1 Answers1

0

Don't try to use regex to manipulate html, using lib such as BeautifulSoup

from bs4 import BeautifulSoup

soup = BeautifulSoup('<BR><CENTER><TABLE BORDER=1 CELLPADDING=0 NOSAVE><TR ALIGN=CENTER NOSAVE><TD COLSPAN="4" NOSAVE><CENTER><B>Summary Table</B></CENTER><TR><TD>Testname</TD><TD>Status</TD><TD>Link to HTML
</TD><TD>Utility</TD></TR><TR><TD>test1</TD><TD>FAIL</TD><TD><A HREF= abc.html>HTML_report</a></BR></TD><TD>run</TD></TR><TR><TD>31Jan2017_03h12m52s</TD><TD>FAIL</TD><TD><A HREF=def.html>HTML_report</a></BR></TD
><TD>run_2</TD></TR></TABLE></CENTER><BR>')

for link in soup.find_all('a'):
    link['href'] = 'fix the absolute path here %s' % (link.get('href'),)

print soup.prettify()
armnotstrong
  • 8,605
  • 16
  • 65
  • 130