0

Alright so this is my code for a webscraper I've build. Right now it scrapes everything that I've selected with soup. But when I view the source code of my page this data includes a <br> which is line break.

When I scrape and save everything to the file, this gets excluded which makes all the data in one line without the <br> tag. I want this <br> to be there after each data is written to the file as follows:

Data<br>Data<br>Data<br>Data<br>

And not:

DataDataDataDataData

Is there anyway to currently modify my code? I think it's the g = item.text.encode('utf-8') that makes it remove the <br>. I would be happy if I could include the <br> in the code because then I can just regex it.

    try :
                t_data = soup.find_all("div", {"class": "blockrow restore"})
                for item in t_data:
                    f = open('test.txt' , 'w')
                    g = item.text.encode('utf-8')
                    f.write(g)
                    f.close 


            finally:

Thanks.

Community
  • 1
  • 1
  • Could you post an abbreviated sample of the HTML you're scraping, showing the relationship between the `div`s you're searching for and the
    tags within them?
    – Jon Winsley Nov 28 '16 at 19:39
  • In other news, it looks like your `for` loop might be overwriting "test.txt" on each iteration. You probably want to open it for [a]ppend instead of [w]rite. – Jon Winsley Nov 28 '16 at 19:41
  • Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data
    Data The output becomes: DataDataDataDataDataDataDataDataData instead of: Data
    Data
    Data
    Data
    Data
    – alexanderjoe Nov 28 '16 at 19:43

1 Answers1

0

If you just want to capture the <br> newlines, you can just replace the <br> tag in the item with a new line character before parsing:

for br in item.find_all("br"):
    br.replace_with("\n")

If you actually want to preserve the internal HTML of the tag, you can just convert the BeautifulSoup item back to a string and print that:

g = unicode(item)
Jon Winsley
  • 106
  • 5