Python and re.compile return inconsistent results

Question

I'm trying to replace all instances of href="../directory" with href="../directory/index.html".

In Python, this

reg = re.compile(r'<a href="../(.*?)">')
for match in re.findall(reg, input_html):
    output_html = input_html.replace(match, match+'index.html')

produces the following output:

href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html"  
href="../paternalism/index.html"  
href="../principle-beneficence/index.htmlindex.htmlindex.html"  
href="../decision-capacity/index.htmlindex.htmlindex.html"

Any idea why it works with the second link, but the others don't?

Relevant part of the source:

<p> 

 <a href="../personal-autonomy/">autonomy: personal</a> |
 <a href="../principle-beneficence/">beneficence, principle of</a> |
 <a href="../decision-capacity/">decision-making capacity</a> |
 <a href="../legal-obligation/">legal obligation and authority</a> |
 <a href="../paternalism/">paternalism</a> |
 <a href="../identity-personal/">personal identity</a> |
 <a href="../identity-ethics/">personal identity: and ethics</a> |
 <a href="../respect/">respect</a> |
 <a href="../well-being/">well-being</a> 

</p>

EDIT: The repeated 'index.html' is actually the result of multiple matches. (e.g. href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html" is because ../personal-autonomy is found four times in the original source).

As a general regex question, how would you replace all instances without adding an additional 'index.html' to all matches?

Why are you trying to parse HTML with regex? There's plenty of powerful parsers that could easily extract these statements by reading the DOM. Regex was not designed for HTML. — wheaties, Jan 27 '11 at 13:27
A solution, of sorts: running .splitlines() on the source HTML, and then running the regex on each line, produced the desired result. However, I'm still not sure why it didn't work without splitting. — cyrus, Jan 27 '11 at 13:51

score 5 · Accepted Answer · edited May 23 '17 at 12:13

5

Don't parse html with regexs:

import re    
from lxml import html

def replace_link(link):
    if re.match(r"\.\./[^/]+/$", link):
        link += "index.html"
    return link

print html.rewrite_links(your_html_text, replace_link)

Output

<p> 

 <a href="../personal-autonomy/index.html">autonomy: personal</a> |
 <a href="../principle-beneficence/index.html">beneficence, principle of</a> |
 <a href="../decision-capacity/index.html">decision-making capacity</a> |
 <a href="../legal-obligation/index.html">legal obligation and authority</a> |
 <a href="../paternalism/index.html">paternalism</a> |
 <a href="../identity-personal/index.html">personal identity</a> |
 <a href="../identity-ethics/index.html">personal identity: and ethics</a> |
 <a href="../respect/index.html">respect</a> |
 <a href="../well-being/index.html">well-being</a> 

</p>

edited May 23 '17 at 12:13

Community

1
1

answered Jan 27 '11 at 14:26

jfs

399,953
195
994
1,670

Thank you, this works perfectly, except the full output is filled with special characters (â€œ etc). Is there something I need to do before or after I call html.rewrite? – cyrus Jan 27 '11 at 14:55
1

@cyrus: pass `your_html_text` as Unicode (use `.decode()`). Encode returned value of `rewrite_links()` using an encoding that your console understands e.g., `s.encode(sys.stdout.encoding or locale.getpreferredencoding())`. – jfs Jan 27 '11 at 15:30
1

@cyrus: if you don't know input encoding you could use the recipe from http://stackoverflow.com/questions/2686709/encoding-in-python-with-lxml-complex-solution/2688617#2688617 and then call `doc.rewrite_links(replace_links)` – jfs Jan 27 '11 at 16:31
Thank you for the link, but I can't seem to encode it using any method. For example, I call chardet.detect(content)['encoding'] before and after I .encode('utf-8') the HTML and it still says 'ascii'. Any ideas? – cyrus Jan 27 '11 at 21:06
@cyrus: Ask a new question that describes: where do you get the html from (file, web-site)? Where do you pass it (file, screen, network)? Provide an example of the failing input/output, the minimal code that reproduces the error, the error/traceback itself. – jfs Jan 28 '11 at 05:23

Arovit · Answer 2 · 2011-01-27T14:21:00.443

1

I think i found out the problem

reg = re.compile(r'<a href="../(.*?)">')

for match in re.findall(reg, input_html):

output_html = input_html.replace(match, match+'index.html')

Here 'input_html' is modified inside the for loop and then same 'input_html' is searched again for the regex which is the bug :)

edited Jan 27 '11 at 14:21

answered Jan 27 '11 at 14:09

Arovit

3,579
5
20
24

Have a different var for storing the result – Arovit Jan 27 '11 at 14:31

Charles Beattie · Answer 3 · 2011-01-27T12:52:20.800

0

Have your tied escaping your first two .?

reg = re.compile(r'<a[ ]href="[.][.]/(.*?)">')

But I would try to use lxml instead.

edited Jan 27 '11 at 12:52

answered Jan 27 '11 at 12:44

Charles Beattie

5,739
1
29
32

why would it matter in this case? – SilentGhost Jan 27 '11 at 13:24

score 0 · Answer 4 · answered Jan 27 '11 at 13:11

0

The problem is the content of the a-tag also matches what you try to replace.

It's in no way the ideal way to do it, but I think you will find it works correctly if you replace your regex with:

reg = re.compile(r'<a href="(\.\./.*?)">')

answered Jan 27 '11 at 13:11

torkildr

501
3
13

Rodrigue · Answer 5 · 2011-01-27T14:06:54.263

0

There is an error in your regex in that the .. does not match two dots. Instead, it is the . metacharacter. To mean a dot, you need to escape it.

Your regex should be: <a href="\.\./(.*?)"

Besides, assuming all your href are of the form ../somedirectory/ you can get away with a simpler regex:

for match in re.compile(r'<a href="(.*?)"').findall(html):
    html = html.replace(match, match + "index.html")

Here, the regex matches

<a href="    # start of the taf and attribute
(            # start of a group
 .*          # any character, any number of times
)            # end of group
"            # end of the attribute

edited Jan 27 '11 at 14:06

answered Jan 27 '11 at 13:15

Rodrigue

3,617
2
37
49

Thanks, Rodrigue. This still produces the same output, however. – cyrus Jan 27 '11 at 13:27
It would also be a tad unlucky if tags happened to be on the same line I think – torkildr Jan 27 '11 at 13:30
@cyrus I have updated my answer to give more explanation. I have also noticed that I had forgotten to reassign the output of `html.replace` in the loop. My example works now – Rodrigue Jan 27 '11 at 13:33
Also, Rodrigue, the ? after an amount will ask the regex to be non-greedy, that is, match the smallest possible, not the biggest possible group – torkildr Jan 27 '11 at 13:34
@torkildr do you not want to be greedy here though? Are double-quotes not allowed in URLs? You want to make sure the matching stops at the end of the href attribute, not in the middle of it. Am I correct? – Rodrigue Jan 27 '11 at 13:45
I don't think so, no. You would probably want to html-entify them "-style either way. The reason it might be a bad idea with greedy is, if you have on the same line, it would match: "> – torkildr Jan 27 '11 at 13:46
@torkildr good point. Thanks for the enlightenment. I have updated my answer to leave the non-greedy version – Rodrigue Jan 27 '11 at 13:54
Thanks for honing the regex - it works well on the sample source. However, it wouldn't work on the live HTML - I had to run splitlines() first. – cyrus Jan 27 '11 at 13:58
1

Use at least `"[^"]+"` instead of `"(.*?)"`. – jfs Jan 27 '11 at 14:34

Python and re.compile return inconsistent results

5 Answers5

Output