How do I resolve error while removing parts of a string in Python DataFrame?

Question

I've searched through a few threads and either I'm getting an error or I'm not getting the expected result when trying either the pandas replace method or the regex re.sub method(python 3.x).

I'm pulling in html data and due to the odd tagging nature I can't extract the data I need. For example each row looks like below

<div class="song">
<p><span class="small">07/06 4:21 AM</span>     -     <span class="small">Title:</span>   Crazy For You   -     <span class="small">Artist:</span>  Scars On 45    
 <span class="small"><a href="http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Ddigital-music&amp;field-keywords=Crazy For You+Scars On 45&amp;tag=wt897fmrafomu-20" target="_blank">Buy Song</a> </span>

I'm using the code below to pull in html data and I want to remove a large chunk of the text to pull out the time/date (ex: 07/06 4:21 AM), artist (ex: Scars on 45), and song (ex: Crazy For You). I'm encountering either errors or the code not working as expected when I try either of the last three lines.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np

html = urlopen("http://wtmd.org/radio/RecentSongs.html")
soup = BeautifulSoup(html.read())
Songs=soup.select('div.song')
#data=np.asarray(Songs)

df = pd.DataFrame({'col1':Songs})
df['col1']=df['col1'].apply(str)

#errors below

df['col1']=df['col1'].replace("<div class=\"song\">",",") #this does not get replaced

df['col1']=re.sub("<div class=\"song\">",",",df['col1'])  #this throws TypeError: expected string or buffer

df['col1']=re.sub("<(.*?)>",",",df['col1'])  #this throws TypeError: expected string or buffer

I've tried these methods both with and without using the

.apply(str)

method, but neither seem to work.

I've tried a few different ways of escaping the quotes in the replace function, (ie using """ and ' to define the find part). Any ideas or insights are greatly appreciated!

score 1 · Accepted Answer · edited May 23 '17 at 12:06

Don't try to parse the HTML with regex. Extract the data using BeautifulSoup, and then pump the data into a DataFrame:

from bs4 import BeautifulSoup
import pandas as pd

content = '''
<div class="song">
<p><span class="small">07/06 4:21 AM</span>     -     <span class="small">Title:</span>   Crazy For You   -     <span class="small">Artist:</span>  Scars On 45    
 <span class="small"><a href="http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Ddigital-music&amp;field-keywords=Crazy For You+Scars On 45&amp;tag=wt897fmrafomu-20" target="_blank">Buy Song</a> </span>
'''

soup = BeautifulSoup(content)

data = list()
for p in soup.select('div.song p'):
    row = list(p.stripped_strings)
    date = row[0]
    title = row[3].strip('- ')
    artist = row[5]
    data.append([date, title, artist])
df = pd.DataFrame(data, columns=['date', 'title', 'artist'])
df['date'] = pd.to_datetime(df['date'])
print(df)

yields

                 date          title       artist
0 2015-07-06 04:21:00  Crazy For You  Scars On 45

score 0 · Answer 2 · answered Jul 09 '15 at 20:13

This will do the substitution you're looking for:

df['col1'].str.replace('<div class="song">', '')

Your .replace() method is trying to find an exact string match to replace rather than a subsitution based on existing parameters: Pandas.DataFrame.replace(). The .str.replace() uses the built in string method in pandas which achieves what you were looking for: Pandas - Working with Text Data.

This would also work:

df['col1'].replace('<div class="song">', '',regex=True)

What HappyLeapSecond has provided though is the better approach to scraping the data.

@HappyLeapSecond - Thanks for the link!!!! I now understand that parsing HTML with regex is a gateway to Cthulu — Ben Porter, Jul 09 '15 at 21:15

How do I resolve error while removing parts of a string in Python DataFrame?

2 Answers2