0

I have read a url with this command:

import urllib2
from bs4 import BeautifulSoup
req = urllib2.Request(url, headers=hdr)
req2 = urllib2.urlopen(req)

content = req2.read()
soup = BeautifulSoup(content, "lxml")

I want to scrape a website with structure like below:

 <div class='\"companyNameWrapper\"'>
\r\n
<div class='\"companyName\"'>
 ACP Holding Deutschland GmbH
</div>
\r\n

problem is because of slashes, commands like

soup.findAll("div", {"class":"companyName"}):

does not work. I need to convert soup to str to use .replace('\', ''), but then the type is string and soup.findAll (and similar bs4 commands are not valid).

Does anyone has suggestion?

Thanks

nakisa
  • 43
  • 1
  • 10

3 Answers3

1

Try to do the next:

content.replace("\r", "").replace("\t", "")
#All replace as you need
soup = BeautifulSoup(content, "lxml")
Wonka
  • 1,548
  • 1
  • 13
  • 20
0

In my opinion, I would consider using regex for this issue. Case in point, if you want to find elements which match the class companyName, then in that case, I would do this.

elements = soup.findAll(re.compile("^companyName"))

This will give you a list having all the matches for that specific class. You can then access them by indexing or even.

I believe I was of help.

  • I saw **regex** and instantly thought of: https://stackoverflow.com/a/1732454/4022608. Using BS's regex handler is fine though :) – Baldrickk Jun 14 '17 at 14:44
0

Did u try in for like this ?

print(item.contents[1].find_all("div", {"class": "companyName"})[0].text.replace('\',''))

Oguz
  • 35
  • 3