How do I search for a link in this web site (on Linux)?

Question

I'm trying to write a xbmc plugin for mako.co.il (I know there is a xbmako but I can't install it on linux). When I try to regexp the episodes site I don't get any result. I tried this web page and I could find the link using a href=".*?">\n\t*<img

Here is a test site: http://www.mako.co.il/mako-vod-keshet/aharoni_cooks

And here is the tutorial: http://wiki.xbmc.org/index.php?title=HOW-TO_write_plugins_for_XBMC

I think it has something to do with the line break, the solution I thought about is to search for anything of the which has a href=".*?"> followed by anything, followed by \t<img

Edit:
OK, so I try to do this dom xml parsing style. I am now stuck because that in line 101 I have a (javascript?) part with a for loop which the parser thinks to be a tag...

Don't parse HTML with a regex (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Instead, parse the DOM. — , Aug 25 '11 at 21:58
So... I don't know where the xml file of the webpage... Looking at the added link, I think I can apply this on an xhtml file... http://www.travisglines.com/web-coding/python-xml-parser-tutorial — Yotam, Aug 26 '11 at 05:25
Not XML, [X]HTML. Use a DOM parser to parse the [X]HTML on the page. — , Aug 26 '11 at 05:30
@Jack Maney: I'm not sure I have understood you. Should I use xml logic on the (downloaded) xml file from the website? — Yotam, Aug 26 '11 at 06:05
No, not XML (unless the information you're looking for is hidden inside of XML). You'll have to pick a language that you're comfortable with and use a DOM parser written in that language. For example, a quick Google search brought up a DOM parser in PHP: http://simplehtmldom.sourceforge.net/ If you know JavaScript, there are also several libraries (Dojo and jQuery are two that come immediately to mind) that allow you to easily grab elements by type (eg grab all anchor tags). — , Aug 26 '11 at 06:15
You're probably writing your plugin in Python, since this is an XBMC question, right? You might benefit from adding the language to the tags list. — Geoff, Dec 04 '12 at 20:49

score 0 · Answer 1 · answered Dec 04 '12 at 20:59

Use a DOM parser

You should not manually parse the HTML file. Instead, try using a DOM parser. I suggest minidom or ElementTree for general Python code.

XBMC

Since you mention XBMC, I suggest that you use the Parsedom plugin, which is designed for this purpose.

The plugin page shows you how to list all the a tags, or to select certain ones.

score 0 · Answer 2 · answered Aug 26 '11 at 02:44

0

The site uses CR-LF for line breaks, but your regex assumes they are LF. You could deal with this by checking for both styles:

a href=".*?">\r?\n\t*<img

answered Aug 26 '11 at 02:44

plasticinsect

1,702
1
13
23

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Aug 26 '11 at 02:47

How do I search for a link in this web site (on Linux)?

2 Answers2