Find and append each reference to a html link - Python

Question

I have a HTML file I got from Wikipedia and would like to find every link on the page such as /wiki/Absinthe and replace it with the current directory added to the front such as /home/fergus/wikiget/wiki/Absinthe so for:

<a href="/wiki/Absinthe">Absinthe</a>

becomes:

<a href="/home/fergus/wikiget/wiki/Absinthe">Absinthe</a>

and this is throughout the whole document.

Do you have any ideas? I'm happy to use BeautifulSoup or Regex!

if you working in linux, then there is a quite simple solution to find and replace text in a document. If I got u rite so please do reply. — Prateek, Mar 07 '11 at 09:29

Mark Longair · Answer 1 · 2011-03-07T10:03:00.357

2

If that's really all you have to do, you could do it with sed and its -i option to rewrite the file in-place:

sed -e 's,href="/wiki,href="/home/fergus/wikiget/wiki,' wiki-file.html

However, here's a Python solution using the lovely lxml API, in case you need to do anything more complex or you might have badly formed HTML, etc.:

from lxml import etree
import re

parser = etree.HTMLParser()

with open("wiki-file.html") as fp:
    tree = etree.parse(fp, parser)

for e in tree.xpath("//a[@href]"):
    link = e.attrib['href']
    if re.search('^/wiki',link):
        e.attrib['href'] = '/home/fergus/wikiget'+link

# Or you can just specify the same filename to overwrite it:
with open("wiki-file-rewritten.html","w") as fp:
    fp.write(etree.tostring(tree))

Note that lxml is probably a better option than BeautifulSoup for this kind of task nowadays, for the reasons given by BeautifulSoup's author.

edited Mar 07 '11 at 10:03

answered Mar 07 '11 at 09:39

Mark Longair

446,582
72
411
327

+1: for using a real parser. `lxml.html.rewrite_links()` is a simpler alternative http://stackoverflow.com/questions/5217760/find-and-append-each-reference-to-a-html-link-python/5218837#5218837 – jfs Mar 07 '11 at 12:58
-1 for using an over-powerful tool that isn't necessary (in fact , I don't downvote, it's useless) – eyquem Mar 07 '11 at 13:56
@J.F. Sebastian: thanks for pointing that out - since you've added an answer using `rewrite_links()` I'll leave mine as is. – Mark Longair Mar 08 '11 at 21:06
@eyquem: Thanks for not actually downvoting, anyway. I did say in my answer "in case you need to do anything more complex or you might have badly formed HTML, etc.", which in my experience very often turns out to be the case. I take your point though. – Mark Longair Mar 08 '11 at 21:12
I hadn't payed enough attention to the condition under which you placed the code with lxml. With this modulation , I agree with you: to do simple tasks, use of simple tools as sed; and for harder ones, use of parsers is better. It's a real problem if XML/HTML are often badly formed: it prevents to use regexes with confidence, while they are perfectly powerful to catch very complicated patterns. **PS**: I wasn't really serious speaking of downvote: I don't think I am enough skilled in Python to put downvotes on far better coders than me. It was a manner to stress my opinion – eyquem Mar 08 '11 at 21:44

score 1 · Answer 2 · answered Mar 07 '11 at 09:31

You can use a function with re.sub:

def match(m):
    return '<a href="/home/fergus/wikiget' + m.group(1) + '">'

r = re.compile(r'<a\shref="([^"]+)">')
r.sub(match, yourtext)

An example:

>>> s = '<a href="/wiki/Absinthe">Absinthe</a>'
>>> r.sub(match, s)
'<a href="/home/fergus/wikiget/wiki/Absinthe">Absinthe</a>'

Paweł Nadolski · Accepted Answer · 2011-03-07T16:12:38.017

1

This is solution using re module:

#!/usr/bin/env python
import re

open('output.html', 'w').write(re.sub('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe', open('file.html').read()))

Here's another one without using re:

#!/usr/bin/env python
open('output.html', 'w').write(open('file.html').read().replace('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe'))

edited Mar 07 '11 at 16:12

answered Mar 07 '11 at 09:40

Paweł Nadolski

8,296
2
42
32

@Fergus Barker, Pawel . Incoherent code: if you do an iteration on lines ``for line in ...`` it is because the file is so big that a treatment by line is obligatory. But **readlines()** treats the entire file in one time. So it must be ``for line in open('file.html')`` , or ``content = open('file.html').read()`` then ``out.write(re.sub('href="/wiki/Absinthe', 'href="/home/fergus/wikiget/wiki/Absinthe',content) `` but not a mix of the two . Moreover, to do what you do, **replace()** is enough ! ``out.write(content.replace('href="/wiki/Absinthe', 'href="/home/fergus/wikiget/wiki/Absinthe')) `` – eyquem Mar 07 '11 at 10:40
@eyquem You're right, incoherent and not efficient but simple and it works. Updated my comment to fix some of the issues you reported. – Paweł Nadolski Mar 07 '11 at 16:21

eyquem · Answer 4 · 2011-03-08T23:51:48.033

I would do

import re

ch = '<a href="/wiki/Absinthe">Absinthe</a>'

r = re.compile('(<a\s+href=")(/wiki/[^"]+">[^<]+</a>)')

print ch
print
print r.sub('\\1/home/fergus/wikiget\\2',ch)

EDIT:

this solution have been said not to capture tags with additional attribute. I thought it was a narrow pattern of string that was aimed, such as <a href="/wiki/WORD">WORD</a>

If not, well, no problem, a solution with a simpler RE is easy to write

r = re.compile('(<a\s+href="/)([^>]+">)')

ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/\\2',ch)

or why not:

r = re.compile('(<a\s+href="/)')

ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/',ch)

jfs · Answer 5 · 2011-03-07T12:53:14.307

0

from lxml import html

el = html.fromstring('<a href="/wiki/word">word</a>')
# or `el = html.parse(file_or_url).getroot()`

def repl(link):
    if link.startswith('/'):
       link = '/home/fergus/wikiget' + link
    return link

print(html.tostring(el))
el.rewrite_links(repl)
print(html.tostring(el))

Output

<a href="/wiki/word">word</a>
<a href="/home/fergus/wikiget/wiki/word">word</a>

You could also use the function lxml.html.rewrite_links() directly:

from lxml import html

def repl(link):
    if link.startswith('/'):
       link = '/home/fergus/wikiget' + link
    return link

print html.rewrite_links(htmlstr, repl)

edited Mar 07 '11 at 12:53

answered Mar 07 '11 at 11:00

jfs

399,953
195
994
1,670

@J.F. Sebastian What does it becomes if the word isn't 'Absinthe' ? – eyquem Mar 07 '11 at 11:28
@eyquem: I've replaced the word to avoid confusion. – jfs Mar 07 '11 at 12:53
@J.F. Sebastian It doesn't solve the problem, your solution with any 'word' can't be generic. Take (http://en.wikipedia.org/wiki/Marcel_Deiss) as an exemple page. There are ``French`` and ``wine grower`` and ``wine grower`` and ``Alsace wine region`` in the same sentence. How will you make 'word' be **France** then **Winemaking** then **Bergheim,_Haut-Rhin** then **Alsace_wine** ? – eyquem Mar 07 '11 at 13:43
@J.F. Sebastian Moreover, there are links like that: ``edit`` - or - ``edit`` – eyquem Mar 07 '11 at 13:48
@eyquem: 1. The comment with **France**, etc is wrong. The only condition that the code uses is that a link starts with `'/'`. 2. If you don't want to convert `"edit"` links you could use `link.startswith('/wiki')` in the `repl()` function. – jfs Mar 07 '11 at 14:10
@J.F. Sebastian 1) Maybe you are right, maybe the word wiki isn't strict as I believed (why I believed that ?) . But the OP should not present a deceitful example. Besides, I read _el = html.fromstring('word')_ in your code 2) Yes. 3) I don't see what superiority have the **fromstring()** and **rewrite_links()** of **lxml** upon **sub()** of **re** module – eyquem Mar 07 '11 at 14:27
@eyquem: here's some examples on "why `lxml` and not `re`": http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege – jfs Mar 07 '11 at 14:45
@J.F. Sebastian Thank you for the link, that's exactly what I wished to find in order to study precisely the reasons of inadequacy of regexes to **parse** HTML and XML. Because I **did** understand that regexes **cannot** parse HTML or XML. But 1) what is the definition of 'parsing' ? 2) can you give me a practical case in which my regex solution fails to find a valid string that should be catched ? – eyquem Mar 07 '11 at 17:14
@eyquem: 1. in this case 'parsing' is a process that converts a sequence of bytes into a meaningful representation such as a process of extracting links from a given html blob 2. your regex easily breaks (even this page contains type of links that your regex wouldn't catch) I leave you as an exercise to find these cases. For html & regex "practicality" means that you don't care that your regex fails (it is good enough). – jfs Mar 07 '11 at 18:34
@J.F. Sebastian What does mean 'html blob 2' ? – eyquem Mar 07 '11 at 18:39
@eyquem: read it here as "sequence of bytes that contains html markup if we convert it to text using appropriate character encoding **.** 2. ... second item in the list starts here ..." – jfs Mar 07 '11 at 19:37
@J.F. Sebastian Oh, I was scatterbrained not to understand that 2 indicates the second part of your comment...... Concerning the instruction with 'Absinthe' or 'word' : I now understand this instruction, ``el = html.fromstring(....)`` transforms an XML text in an lxml representation of the text ; it is not an analogue instruction to a ``re.compile(RE)`` defining what must be searched; on the contrary, it's the first step of the treatment with lxml tool, to obtain the t^pe of data that is manipulated by the lxml functions. The fact that the choosen string is very short deceived me. – eyquem Mar 07 '11 at 23:35
@J.F. Sebastian I'm not at my ease when I read the saying _"no regex for parsing HTML and XML"_ presented as an automatic reason to not use regexes to even search in a HTML or XML formatted text. I say: search, not parse. That's why I asked you what you mean by 'parsing', because my solution to the question of Fergus Barker, and more generally the answers to people who search short and simple strings , don't pretend to parse a HTML/XML text. They just search for strings with the aid of a not so much sophisticated tool than parsers: regexes. – eyquem Mar 08 '11 at 01:35
@J.F. Sebastian They are less powerfull and are inadequate for parsing, but then ? , what is the problem if a squad of regex functions is sufficient to find what is desired without mobilizing an entire army whose only a tiny part will be acting ? The justification is often that a parser is more reliable, and there are plenty of convoluted exemples and nasty cases that are presented as justifications to avoid regexes to search in HTML/XML text. – eyquem Mar 08 '11 at 01:36
@J.F. Sebastian But why would there be a ugly special unmatching case in the source code of a Wikipedia page ? That's why I asked for exemples (not weird ones, just realistic ones) that could make my regex fail. You prompt me to find them myself but I don't see what you mean. – eyquem Mar 08 '11 at 01:37
@J.F. Sebastian I will not continue in the long debate "for/con regexes for HTML/XML texts" that seems to exist on stackoverflow, I'm just not made to obey to rules expressed as general truth of which I don't undestand the bases and I try to understand well – eyquem Mar 08 '11 at 01:42
@eyquem: your regex breaks in common cases such as: a space after the quote, any additional attribute e.g., `title`, nested tags. – jfs Mar 08 '11 at 22:02
@J.F. Sebastian Additional attribute as ```` ? Finally, I (we) don't know what kind of link the OP wants. I had a narrow idea of what he wants, on the basis of what he gave as exemple. Hence the RE I wrote. And the possibility that there is an additional attribute. – eyquem Mar 08 '11 at 23:43
@J.F. Sebastian But if the person who writes the regex is the person who knows what is wanted there are no such problems as 'no match with additional attribute': the person will directly write a correct regex relatively to what he wants to catch. That's a false problem, the insuffisance of my RE doesn't invalidate the use of regexes in general. See my EDIT for a more general RE catching tag with additional attribute. – eyquem Mar 08 '11 at 23:45
I don't see any EDIT, but I'm sure RE in it will have problems too. Just remember: *Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.* http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 If it doesn't persuade you, nothing will. – jfs Mar 08 '11 at 23:54
@J.F. Sebastian After what quote ? Are there nested tag in the Wikipedia page ? I wish concrete cases. (I used the word "practical" for "concrete", but it seems that in english it isn't the same idea as the one in french expressed by "pratique") – eyquem Mar 08 '11 at 23:55
@J.F. Sebastian I **don't** try to **parse**, I already wrote that. I only catch a limited length of string. That's why I asked for the definition of "parsing", because there is an ambiguity that confuses the debate. In Wikipedia: parsing _"is the process of analyzing a text, made of a sequence of tokens, to determine its grammatical structure with respect to a given (more or less) formal grammar. "_ Well, parsing is establishing a tree of tokens. A regex don't establish and verify a tree of an XML/HTML. – eyquem Mar 09 '11 at 00:03
@J.F. Sebastian The cited post is fantastic, excellent. But there is only one rational reason expressed: _"HTML is not a regular language and hence cannot be parsed by regular expressions"_ (by the way, it is about parsing) It is not because a text is very pleasant that it is sufficent to instantly persuade. – eyquem Mar 09 '11 at 00:08

Find and append each reference to a html link - Python

5 Answers5

Output