Extracting the title of my html line

Question

I have a question on extracting the title of an html line.

Let's say my line is:

<span class="title_name"> <a href="/?id=2124">Fairwood</a></span>

and lol, i had to add some extra spaces for the line to not show as a hyperlink..

How would I go about to automatically extract "Fairwood", given a number of lines that are formatted similarly, with different id's and titles.

Thanks in advance

Searching for the string `href`, and then start capturing just after you encounter a `>` until you find a `<` — Haris, Jun 09 '17 at 08:33
You might want to look at this SO post: https://stackoverflow.com/questions/11709079/parsing-html-using-python and also please do not ever ever use regex to parse HTML. See https://stackoverflow.com/a/1732454/190823 — Jens, Jun 09 '17 at 08:35
Perhaps the BeautifulSoup framework or alike could be of help: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring Regex might work in simple cases, but it could be risky: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Elisha, Jun 09 '17 at 08:36

score 0 · Answer 1 · answered Jun 09 '17 at 09:22

0

What is wrong with a parser solution?

import xml.etree.ElementTree as ET
root = ET.fromstring('<span class="title_name"> <a href="/?id=2124">Fairwood</a></span>')
print(root.find("a").text)
# Fairwood

answered Jun 09 '17 at 09:22

Jan

42,290
8
54
79

score 0 · Answer 2 · answered Jun 09 '17 at 09:25

0

If similarly format, then can try :

import re 
html='''
<span class="title_name1"> <a href="/?id=2124">Fairwood1</a></span>
<span class="title_name2"> <a href="/?id=2125">Fairwood2</a></span>'''
print re.findall(r'\w+(?=</a></span>)',html,re.M)

answered Jun 09 '17 at 09:25

Kerwin

1,212
1
7
14

You don't need the multiline flag if there are no anchors to be matched. – Jan Jun 09 '17 at 11:49

Extracting the title of my html line

2 Answers2