I have written some code that help me scrape websites. It has worked well on some sites but I am currently running into an issue.
The collectData() function collects data from a site and appends it to 'dataList'. From this dataList I can create a csv file to export the data.
The issue I am having right now is that the function appends multiple whitespances and \n characters into my list. The output look like this: (the excessive whitespaces are not shown here)
dataList = ['\n 2.500.000 ']
Does anyone what what could cause this? As I mentioned, there are some websites where the function works fine.
Thank you!
def scrape():
dataList = []
pageNr = range(0, 1)
for page in pageNr:
pageUrl = ('https://www.example.com/site:{}'.format(page))
print(pageUrl)
def getUrl(pageUrl):
openUrl = urlopen(pageUrl)
soup = BeautifulSoup(openUrl, 'lxml')
links = soup.find_all('a', class_="ellipsis")
for link in links:
linkNew = link.get('href')
linkList.append(linkNew)
#print(linkList)
return linkList
anzList = getUrl(pageUrl)
lenght = len(anzList)
print(lenght)
anzLinks = []
for i in range(lenght):
anzeigenLinks.append('https://www.example.com/ + anzList[i]')
print(anzLinks)
def collectData():
for link in anzLinks:
openAnz = urlopen(link)
soup = BeautifulSoup(openAnz, 'lxml')
try:
kaufpreisSuche = soup.find('h2')
kaufpreis = kaufpreisSuche.text
dataListe.append(kaufpreis)
print(kaufpreis)
except:
kaufpreis = None
dataListe.append(kaufpreis)