remove \r\n in bs4 to start scraping

Question

I have read a url with this command:

import urllib2
from bs4 import BeautifulSoup
req = urllib2.Request(url, headers=hdr)
req2 = urllib2.urlopen(req)

content = req2.read()
soup = BeautifulSoup(content, "lxml")

I want to scrape a website with structure like below:

 <div class='\"companyNameWrapper\"'>
\r\n
<div class='\"companyName\"'>
 ACP Holding Deutschland GmbH
</div>
\r\n

problem is because of slashes, commands like

soup.findAll("div", {"class":"companyName"}):

does not work. I need to convert soup to str to use .replace('\', ''), but then the type is string and soup.findAll (and similar bs4 commands are not valid).

Does anyone has suggestion?

Thanks

score 1 · Answer 1 · answered Jun 14 '17 at 14:40

1

Try to do the next:

content.replace("\r", "").replace("\t", "")
#All replace as you need
soup = BeautifulSoup(content, "lxml")

answered Jun 14 '17 at 14:40

Wonka

1,548
1
13
20

score 0 · Answer 2 · answered Jun 14 '17 at 14:38

0

In my opinion, I would consider using regex for this issue. Case in point, if you want to find elements which match the class companyName, then in that case, I would do this.

elements = soup.findAll(re.compile("^companyName"))

This will give you a list having all the matches for that specific class. You can then access them by indexing or even.

I believe I was of help.

answered Jun 14 '17 at 14:38

Benson Wainaina

51
1
5

I saw **regex** and instantly thought of: https://stackoverflow.com/a/1732454/4022608. Using BS's regex handler is fine though :) – Baldrickk Jun 14 '17 at 14:44

score 0 · Answer 3 · answered Jun 14 '17 at 14:40

0

Did u try in for like this ?

print(item.contents[1].find_all("div", {"class": "companyName"})[0].text.replace('\',''))

answered Jun 14 '17 at 14:40

Oguz

35
3

remove \r\n in bs4 to start scraping

3 Answers3