I've searched through a few threads and either I'm getting an error or I'm not getting the expected result when trying either the pandas replace method or the regex re.sub method(python 3.x).
I'm pulling in html data and due to the odd tagging nature I can't extract the data I need. For example each row looks like below
<div class="song">
<p><span class="small">07/06 4:21 AM</span> - <span class="small">Title:</span> Crazy For You - <span class="small">Artist:</span> Scars On 45
<span class="small"><a href="http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Ddigital-music&field-keywords=Crazy For You+Scars On 45&tag=wt897fmrafomu-20" target="_blank">Buy Song</a> </span>
I'm using the code below to pull in html data and I want to remove a large chunk of the text to pull out the time/date (ex: 07/06 4:21 AM), artist (ex: Scars on 45), and song (ex: Crazy For You). I'm encountering either errors or the code not working as expected when I try either of the last three lines.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np
html = urlopen("http://wtmd.org/radio/RecentSongs.html")
soup = BeautifulSoup(html.read())
Songs=soup.select('div.song')
#data=np.asarray(Songs)
df = pd.DataFrame({'col1':Songs})
df['col1']=df['col1'].apply(str)
#errors below
df['col1']=df['col1'].replace("<div class=\"song\">",",") #this does not get replaced
df['col1']=re.sub("<div class=\"song\">",",",df['col1']) #this throws TypeError: expected string or buffer
df['col1']=re.sub("<(.*?)>",",",df['col1']) #this throws TypeError: expected string or buffer
I've tried these methods both with and without using the
.apply(str)
method, but neither seem to work.
I've tried a few different ways of escaping the quotes in the replace function, (ie using """ and ' to define the find part). Any ideas or insights are greatly appreciated!