1

I'm trying to replace all instances of href="../directory" with href="../directory/index.html".

In Python, this

reg = re.compile(r'<a href="../(.*?)">')
for match in re.findall(reg, input_html):
    output_html = input_html.replace(match, match+'index.html')

produces the following output:

href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html"  
href="../paternalism/index.html"  
href="../principle-beneficence/index.htmlindex.htmlindex.html"  
href="../decision-capacity/index.htmlindex.htmlindex.html" 

Any idea why it works with the second link, but the others don't?

Relevant part of the source:

<p> 

 <a href="../personal-autonomy/">autonomy: personal</a> |
 <a href="../principle-beneficence/">beneficence, principle of</a> |
 <a href="../decision-capacity/">decision-making capacity</a> |
 <a href="../legal-obligation/">legal obligation and authority</a> |
 <a href="../paternalism/">paternalism</a> |
 <a href="../identity-personal/">personal identity</a> |
 <a href="../identity-ethics/">personal identity: and ethics</a> |
 <a href="../respect/">respect</a> |
 <a href="../well-being/">well-being</a> 

</p> 

EDIT: The repeated 'index.html' is actually the result of multiple matches. (e.g. href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html" is because ../personal-autonomy is found four times in the original source).

As a general regex question, how would you replace all instances without adding an additional 'index.html' to all matches?

cyrus
  • 1,338
  • 3
  • 17
  • 26
  • 1
    Could you show us what the input is as well please? – Rodrigue Jan 27 '11 at 12:54
  • 1
    Why are you trying to parse HTML with regex? There's plenty of powerful parsers that could easily extract these statements by reading the DOM. Regex was not designed for HTML. – wheaties Jan 27 '11 at 13:27
  • A solution, of sorts: running .splitlines() on the source HTML, and then running the regex on each line, produced the desired result. However, I'm still not sure why it didn't work without splitting. – cyrus Jan 27 '11 at 13:51

5 Answers5

5

Don't parse html with regexs:

import re    
from lxml import html

def replace_link(link):
    if re.match(r"\.\./[^/]+/$", link):
        link += "index.html"
    return link

print html.rewrite_links(your_html_text, replace_link)

Output

<p> 

 <a href="../personal-autonomy/index.html">autonomy: personal</a> |
 <a href="../principle-beneficence/index.html">beneficence, principle of</a> |
 <a href="../decision-capacity/index.html">decision-making capacity</a> |
 <a href="../legal-obligation/index.html">legal obligation and authority</a> |
 <a href="../paternalism/index.html">paternalism</a> |
 <a href="../identity-personal/index.html">personal identity</a> |
 <a href="../identity-ethics/index.html">personal identity: and ethics</a> |
 <a href="../respect/index.html">respect</a> |
 <a href="../well-being/index.html">well-being</a> 

</p>
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Thank you, this works perfectly, except the full output is filled with special characters (“ etc). Is there something I need to do before or after I call html.rewrite? – cyrus Jan 27 '11 at 14:55
  • 1
    @cyrus: pass `your_html_text` as Unicode (use `.decode()`). Encode returned value of `rewrite_links()` using an encoding that your console understands e.g., `s.encode(sys.stdout.encoding or locale.getpreferredencoding())`. – jfs Jan 27 '11 at 15:30
  • 1
    @cyrus: if you don't know input encoding you could use the recipe from http://stackoverflow.com/questions/2686709/encoding-in-python-with-lxml-complex-solution/2688617#2688617 and then call `doc.rewrite_links(replace_links)` – jfs Jan 27 '11 at 16:31
  • Thank you for the link, but I can't seem to encode it using any method. For example, I call chardet.detect(content)['encoding'] before and after I .encode('utf-8') the HTML and it still says 'ascii'. Any ideas? – cyrus Jan 27 '11 at 21:06
  • @cyrus: Ask a new question that describes: where do you get the html from (file, web-site)? Where do you pass it (file, screen, network)? Provide an example of the failing input/output, the minimal code that reproduces the error, the error/traceback itself. – jfs Jan 28 '11 at 05:23
1

I think i found out the problem

reg = re.compile(r'<a href="../(.*?)">')

for match in re.findall(reg, input_html):

output_html = input_html.replace(match, match+'index.html')

Here 'input_html' is modified inside the for loop and then same 'input_html' is searched again for the regex which is the bug :)

Arovit
  • 3,579
  • 5
  • 20
  • 24
0

Have your tied escaping your first two .?

reg = re.compile(r'<a[ ]href="[.][.]/(.*?)">')

But I would try to use lxml instead.

Charles Beattie
  • 5,739
  • 1
  • 29
  • 32
0

The problem is the content of the a-tag also matches what you try to replace.

It's in no way the ideal way to do it, but I think you will find it works correctly if you replace your regex with:

reg = re.compile(r'<a href="(\.\./.*?)">')
torkildr
  • 501
  • 3
  • 13
0

There is an error in your regex in that the .. does not match two dots. Instead, it is the . metacharacter. To mean a dot, you need to escape it.

Your regex should be: <a href="\.\./(.*?)"

Besides, assuming all your href are of the form ../somedirectory/ you can get away with a simpler regex:

for match in re.compile(r'<a href="(.*?)"').findall(html):
    html = html.replace(match, match + "index.html")

Here, the regex matches

<a href="    # start of the taf and attribute
(            # start of a group
 .*          # any character, any number of times
)            # end of group
"            # end of the attribute
Rodrigue
  • 3,617
  • 2
  • 37
  • 49
  • Thanks, Rodrigue. This still produces the same output, however. – cyrus Jan 27 '11 at 13:27
  • It would also be a tad unlucky if tags happened to be on the same line I think – torkildr Jan 27 '11 at 13:30
  • @cyrus I have updated my answer to give more explanation. I have also noticed that I had forgotten to reassign the output of `html.replace` in the loop. My example works now – Rodrigue Jan 27 '11 at 13:33
  • Also, Rodrigue, the ? after an amount will ask the regex to be non-greedy, that is, match the smallest possible, not the biggest possible group – torkildr Jan 27 '11 at 13:34
  • @torkildr do you not want to be greedy here though? Are double-quotes not allowed in URLs? You want to make sure the matching stops at the end of the href attribute, not in the middle of it. Am I correct? – Rodrigue Jan 27 '11 at 13:45
  • I don't think so, no. You would probably want to html-entify them "-style either way. The reason it might be a bad idea with greedy is, if you have on the same line, it would match: "> – torkildr Jan 27 '11 at 13:46
  • @torkildr good point. Thanks for the enlightenment. I have updated my answer to leave the non-greedy version – Rodrigue Jan 27 '11 at 13:54
  • Thanks for honing the regex - it works well on the sample source. However, it wouldn't work on the live HTML - I had to run splitlines() first. – cyrus Jan 27 '11 at 13:58
  • 1
    Use at least `"[^"]+"` instead of `"(.*?)"`. – jfs Jan 27 '11 at 14:34