2

I'm trying to extract text from parent comments on the website songmeanings.com using BeautifulSoup from the following HTML:

<div class="text" id="comment-73014911864">
 <strong class="title">
  General Comment
 </strong>
 This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
 <br/>
 <br/>
 (a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
 <br/>
 (b) He has talent. He can actually rap. I don't think d12 is any good. =/
 <br/>
 <br/>
 Anyway. I love this song and I'm getting his new CD right now... hehe.
 <br/>
 -Sarah
 <div class="sign">
  <a class="author" href="/profiles/view/17067478/" id="userprofile-17067478" rel="me nofollow" title="xoDonnieDarko">
   xoDonnieDarko
  </a>
  <em class="date">
   on December 06, 2005
  </em>
  <a href="/songs/view/3530822107858560012/?&amp;specific_com=73014911864#comments" id="specific_com-73014911864" rel="nofollow" title="Permalink">
   Link
  </a>
 </div>
 <ul class="answers">
  <li>
   <div class="title">
    <a class="replies close-replies" href="#" id="showreplies-73014911864" rel="nofollow" title="3 Replies">
     3 Replies
    </a>
    <span class="login">
     <a class="lightbox" href="#popup-loginform" rel="nofollow">
      Log in to reply
     </a>
    </span>
    <br>
    </br>
   </div>
   <div id="formreply-73014911864" style="display: none;">
    <!-- comment-form -->
    <form action="#" class="comment-form-reply" id="comment-form-reply-73014911864">
     <div class="area" id="reply-errors-box" style="display: none;">
      <label for="type">
      </label>
      <span id="reply-errors" style="color: #ff0000;">
       There was an error.
      </span>
     </div>
     <div class="area">
      <div class="textarea">
       <div class="holder">
        <div class="frame">
         <textarea class="frmreplycomment-73014911864" id="frmreplycomment" name="frmreplycomment">
          @xoDonnieDarko
         </textarea>
        </div>
       </div>
      </div>
     </div>
     <input id="frmreplylid" name="frmreplylid" type="hidden" value="3530822107858560012">
      <input id="frmaid" name="frmaid" type="hidden" value="94">
       <input id="frmreplycid" name="frmreplycid" type="hidden" value="73014911864">
        <input class="submit" type="submit" value="Add reply"/>
       </input>
      </input>
     </input>
    </form>
   </div>
   <div id="thesereplies-73014911864" style="display: none;">
    <div class="answer-holder" id="fullcomment-73015890665">
     <a name="comment-73015890665">
     </a>
     <div id="rating-holder-73015890665">
      <div class="numb-holder">
       <span id="com-rating-73015890665">
        <strong class="numb" id="numb-rating-73015890665">
         +1
        </strong>
       </span>
       <div class="com-whorated" id="com-whorated-73015890665" style="display: none; text-align: center;">
        <span class="processing">
        </span>
       </div>
       <div id="processing-73015890665" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
        <span class="processing">
        </span>
       </div>
      </div>
     </div>
     <div class="text">
      i agree he is the only rapper i can listen too.
      <div class="sign">
       <span id="flagspan-73015890665">
        <a class="flag" href="#" id="flag-73015890665">
         Flag
        </a>
       </span>
       <a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
        byrdman1992
       </a>
       <em class="date">
        on March 15, 2010
       </em>
      </div>
     </div>
    </div>
    <div class="answer-holder" id="fullcomment-73015961779">
     <a name="comment-73015961779">
     </a>
     <div id="rating-holder-73015961779">
      <div class="numb-holder">
       <span id="com-rating-73015961779">
        <strong class="numb" id="numb-rating-73015961779">
         0
        </strong>
       </span>
       <div class="com-whorated" id="com-whorated-73015961779" style="display: none; text-align: center;">
        <span class="processing">
        </span>
       </div>
       <div id="processing-73015961779" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
        <span class="processing">
        </span>
       </div>
      </div>
     </div>
     <div class="text">
      same her the ONLY one...and sometimes lil' wayne! lol
      <div class="sign">
       <span id="flagspan-73015961779">
        <a class="flag" href="#" id="flag-73015961779">
         Flag
        </a>
       </span>
       <a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
        dancer017
       </a>
       <em class="date">
        on August 26, 2010
       </em>
      </div>
     </div>
    </div>
    <div class="answer-holder" id="fullcomment-73016306033">
     <a name="comment-73016306033">
     </a>
     <div id="rating-holder-73016306033">
      <div class="numb-holder">
       <span id="com-rating-73016306033">
        <strong class="numb" id="numb-rating-73016306033">
         0
        </strong>
       </span>
       <div class="com-whorated" id="com-whorated-73016306033" style="display: none; text-align: center;">
        <span class="processing">
        </span>
       </div>
       <div id="processing-73016306033" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
        <span class="processing">
        </span>
       </div>
      </div>
     </div>
     <div class="text">
      <a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
       @xoDonnieDarko
      </a>
      RIttz is pretty good.. Can listen to yela and tech too.
      <div class="sign">
       <span id="flagspan-73016306033">
        <a class="flag" href="#" id="flag-73016306033">
         Flag
        </a>
       </span>
       <a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
        Heeltoehole
       </a>
       <em class="date">
        on September 05, 2015
       </em>
      </div>
     </div>
    </div>
   </div>
  </li>
 </ul>
</div>

<div class="text">
 i agree he is the only rapper i can listen too.
 <div class="sign">
  <span id="flagspan-73015890665">
   <a class="flag" href="#" id="flag-73015890665">
    Flag
   </a>
  </span>
  <a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
   byrdman1992
  </a>
  <em class="date">
   on March 15, 2010
  </em>
 </div>
</div>

<div class="text">
 same her the ONLY one...and sometimes lil' wayne! lol
 <div class="sign">
  <span id="flagspan-73015961779">
   <a class="flag" href="#" id="flag-73015961779">
    Flag
   </a>
  </span>
  <a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
   dancer017
  </a>
  <em class="date">
   on August 26, 2010
  </em>
 </div>
</div>

<div class="text">
 <a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
  @xoDonnieDarko
 </a>
 RIttz is pretty good.. Can listen to yela and tech too.
 <div class="sign">
  <span id="flagspan-73016306033">
   <a class="flag" href="#" id="flag-73016306033">
    Flag
   </a>
  </span>
  <a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
   Heeltoehole
  </a>
  <em class="date">
   on September 05, 2015
  </em>
 </div>
</div>

Using this code I am able to extract most text from the comments, but any comments with line breaks will have missing content:

import urllib2
from bs4 import BeautifulSoup

url = "http://songmeanings.com/songs/view/3530822107858560012/"
response = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url)
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')

for strong_tag in soup.find_all('strong'):
    print strong_tag.next_sibling

Which gives the output:

This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,

What I want is:

This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,

(a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
(b) He has talent. He can actually rap. I don't think d12 is any good. =/

Anyway. I love this song and I'm getting his new CD right now... hehe.
-Sarah

How can I extract all text from a parent comment? Is there a better way to do this than using the strong tag?

Otis Cheng
  • 23
  • 3
  • I am confused, is it a comment or it is a body text ? If it is a comment then there is Comments module from bs4 which can be used but if it is a body text then it will be quite complicated. – Shashank May 03 '17 at 16:24
  • I edited my post for clarification, I am trying to extract the text from each comment on that webpage. – Otis Cheng May 03 '17 at 17:05
  • With some research I found one answer in another question which is similar to this one. Have a look at the accepted answer, it will work fine for you except you need to take care of few '\n' and '\t' special characters while giving output. Link: http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text – Shashank May 03 '17 at 17:56
  • The website has other forms of text on it as well such as the song lyrics and headers and such. I want to extract solely the text from the comments on that page. Would I still be able to achieve that goal with what you linked? – Otis Cheng May 03 '17 at 18:13
  • I guess you might have to give it a shot because that's the only solution i came up if we are dealing with body text only. But still you might have to optimize it a lot. – Shashank May 03 '17 at 19:56

1 Answers1

0

I slightly modified https://stackoverflow.com/a/11809215/42346 (give him an upvote!) to get this solution:

def loop_until(text,first_elem):
  try: 
    text += first_elem.string
    if first_elem.next == first_elem.find_next('div'):
      return text
    else:
      return loop_until(text,first_elem.next.next)
  except TypeError:
    pass 

Call it like this:

next_elem = soup.find_all('strong')[0].nextSibling
loop_until('',next_elem)

Result:

 u"\n This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,\n \n\n (a) His songs have meaning. They're not about sex and cars and bling blingin' rims.\n \n (b) He has talent. He can actually rap. I don't think d12 is any good. =/\n \n\n Anyway. I love this song and I'm getting his new CD right now... hehe.\n \n -Sarah\n "
Community
  • 1
  • 1
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • I tried the code you provided and the output didn't look anything like yours, it was still riddled with other information that I didn't. I updated my question with a more comprehensive HTML for the comment that I'm trying to extract. – Otis Cheng May 03 '17 at 17:28