I am trying to remove all the html surrounding the data that I seek from a webpage so that all that is left is the raw data that I will then be able to input into a database. so if I have something like:
<p class="location"> Atlanta, GA </p>
The following code would return
Atlanta, GA </p>
But what I expect is not what is returned. This is a more specific solution to the basic problem I found here. Any help would be appreciated, thanks! Code is found below.
def delHTML(self, html):
"""
html is a list made up of items with data surrounded by html
this function should get rid of the html and return the data as a list
"""
for n,i in enumerate(html):
if i==re.match('<p class="location">',str(html[n])):
html[n]=re.sub('<p class="location">', '', str(html[n]))
return html