0

I was scraping a page in the Danish Language. I am having trouble with the output. The output contains many special characters like (Ã¥, Ã, Ã¥, æ) and it's not like the one on the page.

How can I scrape the text just like on the page?

Example link: https://novaindex.com/dk/leverandoerer/mode-og-tekstiler/arbejdstoej

import scrapy
    
class MainSpider(scrapy.Spider):
    name = 'main'

    start_urls = ['https://novaindex.com/dk/leverandoerer/mode-og-tekstiler/arbejdstoej']

    def parse(self, response):

        details = response.xpath('//a[@class="companyresult "]')

        for each in details:
            name = each.xpath('normalize-space(.//span[@class="name"]/text())').get()
            street = each.xpath('normalize-space(.//span[@class="street"]/text())').get()
            city = each.xpath('normalize-space(.//span[@class="city"]/text())').get()
            phone = each.xpath('normalize-space(.//span[@class="phone"]/text())').get()

            yield {
                "Name": name,
                "Street Address": street,
                "City Address": city,
                "Phone": phone,
            }
codewithawais
  • 551
  • 1
  • 6
  • 25
  • Which python version do you use? I strongly suggest you to switch to python3 which will solve most of your problems with unicode symbols. – Michael Savchenko Jul 30 '20 at 10:57
  • I am using Python 3.7 – codewithawais Jul 30 '20 at 12:16
  • I tried it with selenium with utf-8 and it's didn't' work but when I removed utf-8 encoding, the output was same as the website. So, do you know how can I ignore utf-8 encoding with scrapy? – codewithawais Jul 30 '20 at 12:37
  • 3.7 is nice. And one more question: what's your output? I mean where do you see broken symbols? do you scrape to file/to screen/to database? – Michael Savchenko Jul 30 '20 at 13:59
  • In general scrapy gets the page content correctly in 99.99999999% of time. The problem in your case is to show this content correctly. I believe if you'll add some print(name) statements inside of your code you'll see correct letters. – Michael Savchenko Jul 30 '20 at 14:03
  • I think the issue is with the exporting data to CSV file. I exported the data to JSON and it worked fine. – codewithawais Jul 30 '20 at 14:55
  • hah... it's quite known problem indeed. Which app do you use to open csv file? You should mention Utf encoding there to open file correctly. I know Excel has some troubles with it. Unfortunately I can't guide you in this because I don't use anything except of linux for years already. – Michael Savchenko Jul 30 '20 at 15:11
  • 1
    Yes, I used excel to open the file. Anyways, thanks a lot. I converted the JSON file to Excel and it worked for me. – codewithawais Jul 30 '20 at 15:19

2 Answers2

0

You could add .encode('utf8') after get() or getall()

Scrapy extracts data as unicode strings, this may help you understand abit about unicode and UTF-8.

What is a unicode string?

AaronS
  • 2,245
  • 2
  • 6
  • 16
  • Unfortunately, this did not work. I tried it with selenium with utf-8 and it's didn't' work but when I removed utf-8 encoding, the output was same as the website. So, do you know how can I ignore utf-8 encoding with scrapy? – codewithawais Jul 30 '20 at 12:29
0

Danish codec is cp865 check all available codecs here

NB: Use ascii only if your scraping english website.

def string_cleaner(rouge_text):
    return ("".join(rouge_text.strip()).encode('cp865', 'ignore').decode("cp865"))

Use ignore to ignore errors

Usage

 yield {
                "Name": string_cleaner(name),
                ...
            }

More explanation on what the code does check my code breakdown here

xaander1
  • 1,064
  • 2
  • 12
  • 40
  • This method completely skips the letters. NORDJYSK DØGNGALVANISERING AKTIESELSKAB to NORDJYSK DGNGALVANISERING AKTIESELSKAB it has skipped the following letter Ø – codewithawais Jul 30 '20 at 12:32
  • use the danish code `cp865` instead of `ascii` which is english. Updated the answer – xaander1 Jul 30 '20 at 18:34