0

I have written some code that help me scrape websites. It has worked well on some sites but I am currently running into an issue.

The collectData() function collects data from a site and appends it to 'dataList'. From this dataList I can create a csv file to export the data.

The issue I am having right now is that the function appends multiple whitespances and \n characters into my list. The output look like this: (the excessive whitespaces are not shown here)

dataList = ['\n 2.500.000 ']

Does anyone what what could cause this? As I mentioned, there are some websites where the function works fine.

Thank you!

def scrape():

dataList = []
pageNr = range(0, 1)

for page in pageNr:
    pageUrl = ('https://www.example.com/site:{}'.format(page))
    print(pageUrl)

    def getUrl(pageUrl):
        openUrl = urlopen(pageUrl)
        soup = BeautifulSoup(openUrl, 'lxml')
        links = soup.find_all('a', class_="ellipsis")
        for link in links:
            linkNew = link.get('href')
            linkList.append(linkNew)
            #print(linkList)
            return linkList

    anzList = getUrl(pageUrl)

    lenght = len(anzList)
    print(lenght)
    anzLinks = []

    for i in range(lenght):
        anzeigenLinks.append('https://www.example.com/ + anzList[i]')

    print(anzLinks)

    def collectData():

        for link in anzLinks:
            openAnz = urlopen(link)
            soup = BeautifulSoup(openAnz, 'lxml')
            try:
                kaufpreisSuche = soup.find('h2')
                kaufpreis = kaufpreisSuche.text
                dataListe.append(kaufpreis)
                print(kaufpreis)
            except:
                kaufpreis = None
                dataListe.append(kaufpreis)
MrCel
  • 1
  • 1

0 Answers0