BeautifulSoup .text method returns text without separators (\n, \r etc)

Question

I've tryed to parse song lyrics from biggest russian lyrics site http://amalgama-lab.com and save lyrics (translated and original) into audio list from my Vkontakte account(sadly, amalgama doesn't have any API)

import urllib
from BeautifulSoup import BeautifulSoup
import vkontakte
vk = vkontakte.API(token=<SECRET_TOKEN>)
audios = vk.getAudios(count='2')
#{u'artist': u'The Beatles', u'url': u'http://cs4519.vkontakte.ru/u4665445/audio/4241af71a888.mp3', u'title': u'Yesterday', u'lyrics_id': u'2365986', u'duration': 130, u'aid': 166194990, u'owner_id': 173505924}
url = 'http://amalgama.mobi/songs/'
for i in audios:
    print i['artist']
    if i['artist'].startswith('The '):
        url += i['artist'][4:5] + '/' + i['artist'][4:].replace(' ', '_') + '/'     +i['title'].replace(' ', '_') + '.html'
    else:
        url += i['artist'][:1] + '/' + i['artist'].replace(' ', '_') + '/' +i['title'].replace(' ', '_') + '.html'
    url = url.lower()
    page = urllib.urlopen(url)
    soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
    texts = soup.findAll('ol', )
    if len(texts) != 0:
        en = texts[0].text #this!
        ru = texts[1].text #this!
        vk.get('audio.edit', aid=i['aid'], oid = i['owner_id'], artist=i['artist'], title = i['title'], text = ru, no_search = 0)

but .text method returns string without any separators:

"Yesterday, all my troubles seemed so far awayNow it look as though they're here to stayOh, I believe in yesterdaySuddenly, I'm not half the man I used to beThere's a shadow hanging over meOh, yesterday came suddenly[Chorus:]Why she had to go I don't know, she wouldn't sayI said something wrong, now I long for yesterdayYesterday, love was such an easy game to playNow I need a place to hide awayOh, I believe in"

It's main problem. Next, what better way to save lyrics such this way:

Lyrics line 1 (Original)

Lyrics line 1 (Translated)

Lyrics line 2 (Original)

Lyrics line 2 (Translated)

Lyrics line 3 (Original)

Lyrics line 3 (Translated)

...

? I get only messy code. Thanks

Example: http://amalgama.mobi/songs/b/beatles/yesterday.html — Martijn Pieters, Aug 25 '12 at 17:40
Note that there *are* no newlines in the songtext, only `
` tags, which the OP is stripping out.. — Martijn Pieters, Aug 25 '12 at 17:40
I know:) What better way to convert html > text? OFC, I can replace
with '\n', and remove all other tags by myself, but it would look ..dirtly — just so, Aug 25 '12 at 17:58

score 31 · Answer 1 · edited Sep 08 '17 at 16:52

31

Try the separator parameter of the get_text method:

from bs4 import BeautifulSoup
html = '''<p> Hi. This is a simple example.<br>Yet poweful one. <p>'''
soup = Beautifulsoup(html)
soup.get_text()  
# Output: u' Hi. This is a simple example.Yet poweful one. '
soup.get_text(separator=' ')  
# Output: u' Hi. This is a simple example. Yet poweful one. '

edited Sep 08 '17 at 16:52

Florian Brucker

9,621
3
48
81

answered Nov 02 '16 at 15:14

Bishwas Mishra

1,235
1
12
25

3

Thanks, it does the trick. I use it to get the text of a webpage, then I use `re.sub(r"(\n( ?))+", "\n", my_text)` to remove multiple carriage returns, and `re.sub(r" +", " ", my_text)` to remove multiple spaces. – sodimel Nov 06 '19 at 08:49

score 7 · Answer 2 · edited Jan 19 '21 at 16:56

7

I suggest you look into the .strings generator found in BeautifulSoup 4.

edited Jan 19 '21 at 16:56

MendelG

14,885
4
25
52

answered Aug 26 '12 at 03:18

Leonard Richardson

3,994
2
17
10

In addition, you may pay attention to `stripped_strings`. If you want to iterate the generator, you could try this `for string in soup.stripped_strings:` for instance. – Sebastian Cardona Osorio Mar 09 '19 at 02:59

Nasir · Answer 3 · 2012-08-25T18:24:56.657

0

You can do this:

soup = BeautifulSoup(html)
ols = soup.findAll('ol') # for the two languages

for ol in ols: 
    ps = ol.findAll('p')
    for p in ps:
        for item in p.contents:
            if str(item)!='<br />':
                print str(item)

edited Aug 25 '12 at 18:24

answered Aug 25 '12 at 18:19

Nasir

1,982
4
19
35

BeautifulSoup .text method returns text without separators (\n, \r etc)

3 Answers3