I'm trying to extract text from parent comments on the website songmeanings.com using BeautifulSoup from the following HTML:
<div class="text" id="comment-73014911864">
<strong class="title">
General Comment
</strong>
This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
<br/>
<br/>
(a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
<br/>
(b) He has talent. He can actually rap. I don't think d12 is any good. =/
<br/>
<br/>
Anyway. I love this song and I'm getting his new CD right now... hehe.
<br/>
-Sarah
<div class="sign">
<a class="author" href="/profiles/view/17067478/" id="userprofile-17067478" rel="me nofollow" title="xoDonnieDarko">
xoDonnieDarko
</a>
<em class="date">
on December 06, 2005
</em>
<a href="/songs/view/3530822107858560012/?&specific_com=73014911864#comments" id="specific_com-73014911864" rel="nofollow" title="Permalink">
Link
</a>
</div>
<ul class="answers">
<li>
<div class="title">
<a class="replies close-replies" href="#" id="showreplies-73014911864" rel="nofollow" title="3 Replies">
3 Replies
</a>
<span class="login">
<a class="lightbox" href="#popup-loginform" rel="nofollow">
Log in to reply
</a>
</span>
<br>
</br>
</div>
<div id="formreply-73014911864" style="display: none;">
<!-- comment-form -->
<form action="#" class="comment-form-reply" id="comment-form-reply-73014911864">
<div class="area" id="reply-errors-box" style="display: none;">
<label for="type">
</label>
<span id="reply-errors" style="color: #ff0000;">
There was an error.
</span>
</div>
<div class="area">
<div class="textarea">
<div class="holder">
<div class="frame">
<textarea class="frmreplycomment-73014911864" id="frmreplycomment" name="frmreplycomment">
@xoDonnieDarko
</textarea>
</div>
</div>
</div>
</div>
<input id="frmreplylid" name="frmreplylid" type="hidden" value="3530822107858560012">
<input id="frmaid" name="frmaid" type="hidden" value="94">
<input id="frmreplycid" name="frmreplycid" type="hidden" value="73014911864">
<input class="submit" type="submit" value="Add reply"/>
</input>
</input>
</input>
</form>
</div>
<div id="thesereplies-73014911864" style="display: none;">
<div class="answer-holder" id="fullcomment-73015890665">
<a name="comment-73015890665">
</a>
<div id="rating-holder-73015890665">
<div class="numb-holder">
<span id="com-rating-73015890665">
<strong class="numb" id="numb-rating-73015890665">
+1
</strong>
</span>
<div class="com-whorated" id="com-whorated-73015890665" style="display: none; text-align: center;">
<span class="processing">
</span>
</div>
<div id="processing-73015890665" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
<span class="processing">
</span>
</div>
</div>
</div>
<div class="text">
i agree he is the only rapper i can listen too.
<div class="sign">
<span id="flagspan-73015890665">
<a class="flag" href="#" id="flag-73015890665">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
byrdman1992
</a>
<em class="date">
on March 15, 2010
</em>
</div>
</div>
</div>
<div class="answer-holder" id="fullcomment-73015961779">
<a name="comment-73015961779">
</a>
<div id="rating-holder-73015961779">
<div class="numb-holder">
<span id="com-rating-73015961779">
<strong class="numb" id="numb-rating-73015961779">
0
</strong>
</span>
<div class="com-whorated" id="com-whorated-73015961779" style="display: none; text-align: center;">
<span class="processing">
</span>
</div>
<div id="processing-73015961779" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
<span class="processing">
</span>
</div>
</div>
</div>
<div class="text">
same her the ONLY one...and sometimes lil' wayne! lol
<div class="sign">
<span id="flagspan-73015961779">
<a class="flag" href="#" id="flag-73015961779">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
dancer017
</a>
<em class="date">
on August 26, 2010
</em>
</div>
</div>
</div>
<div class="answer-holder" id="fullcomment-73016306033">
<a name="comment-73016306033">
</a>
<div id="rating-holder-73016306033">
<div class="numb-holder">
<span id="com-rating-73016306033">
<strong class="numb" id="numb-rating-73016306033">
0
</strong>
</span>
<div class="com-whorated" id="com-whorated-73016306033" style="display: none; text-align: center;">
<span class="processing">
</span>
</div>
<div id="processing-73016306033" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
<span class="processing">
</span>
</div>
</div>
</div>
<div class="text">
<a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
@xoDonnieDarko
</a>
RIttz is pretty good.. Can listen to yela and tech too.
<div class="sign">
<span id="flagspan-73016306033">
<a class="flag" href="#" id="flag-73016306033">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
Heeltoehole
</a>
<em class="date">
on September 05, 2015
</em>
</div>
</div>
</div>
</div>
</li>
</ul>
</div>
<div class="text">
i agree he is the only rapper i can listen too.
<div class="sign">
<span id="flagspan-73015890665">
<a class="flag" href="#" id="flag-73015890665">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
byrdman1992
</a>
<em class="date">
on March 15, 2010
</em>
</div>
</div>
<div class="text">
same her the ONLY one...and sometimes lil' wayne! lol
<div class="sign">
<span id="flagspan-73015961779">
<a class="flag" href="#" id="flag-73015961779">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
dancer017
</a>
<em class="date">
on August 26, 2010
</em>
</div>
</div>
<div class="text">
<a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
@xoDonnieDarko
</a>
RIttz is pretty good.. Can listen to yela and tech too.
<div class="sign">
<span id="flagspan-73016306033">
<a class="flag" href="#" id="flag-73016306033">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
Heeltoehole
</a>
<em class="date">
on September 05, 2015
</em>
</div>
</div>
Using this code I am able to extract most text from the comments, but any comments with line breaks will have missing content:
import urllib2
from bs4 import BeautifulSoup
url = "http://songmeanings.com/songs/view/3530822107858560012/"
response = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url)
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for strong_tag in soup.find_all('strong'):
print strong_tag.next_sibling
Which gives the output:
This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
What I want is:
This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
(a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
(b) He has talent. He can actually rap. I don't think d12 is any good. =/Anyway. I love this song and I'm getting his new CD right now... hehe.
-Sarah
How can I extract all text from a parent comment? Is there a better way to do this than using the strong tag?