-3

I have a question on extracting the title of an html line.

Let's say my line is:

<span class="title_name"> <a href="/?id=2124">Fairwood</a></span>

and lol, i had to add some extra spaces for the line to not show as a hyperlink..

How would I go about to automatically extract "Fairwood", given a number of lines that are formatted similarly, with different id's and titles.

Thanks in advance

piokuc
  • 25,594
  • 11
  • 72
  • 102
  • 1
    Why the downvotes? A small comment could be more helpful. – Ébe Isaac Jun 09 '17 at 08:31
  • Searching for the string `href`, and then start capturing just after you encounter a `>` until you find a `<` – Haris Jun 09 '17 at 08:33
  • You might want to look at this SO post: https://stackoverflow.com/questions/11709079/parsing-html-using-python and also please do not ever ever use regex to parse HTML. See https://stackoverflow.com/a/1732454/190823 – Jens Jun 09 '17 at 08:35
  • Perhaps the BeautifulSoup framework or alike could be of help: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring Regex might work in simple cases, but it could be risky: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Elisha Jun 09 '17 at 08:36

2 Answers2

0

What is wrong with a parser solution?

import xml.etree.ElementTree as ET
root = ET.fromstring('<span class="title_name"> <a href="/?id=2124">Fairwood</a></span>')
print(root.find("a").text)
# Fairwood
Jan
  • 42,290
  • 8
  • 54
  • 79
0

If similarly format, then can try :

import re 
html='''
<span class="title_name1"> <a href="/?id=2124">Fairwood1</a></span>
<span class="title_name2"> <a href="/?id=2125">Fairwood2</a></span>'''
print re.findall(r'\w+(?=</a></span>)',html,re.M)
Kerwin
  • 1,212
  • 1
  • 7
  • 14