Scraping special characters with SCRAPY

Question

I was scraping a page in the Danish Language. I am having trouble with the output. The output contains many special characters like (Ã¥, Ã, Ã¥, Ã¦) and it's not like the one on the page.

How can I scrape the text just like on the page?

Example link: https://novaindex.com/dk/leverandoerer/mode-og-tekstiler/arbejdstoej

import scrapy
    
class MainSpider(scrapy.Spider):
    name = 'main'

    start_urls = ['https://novaindex.com/dk/leverandoerer/mode-og-tekstiler/arbejdstoej']

    def parse(self, response):

        details = response.xpath('//a[@class="companyresult "]')

        for each in details:
            name = each.xpath('normalize-space(.//span[@class="name"]/text())').get()
            street = each.xpath('normalize-space(.//span[@class="street"]/text())').get()
            city = each.xpath('normalize-space(.//span[@class="city"]/text())').get()
            phone = each.xpath('normalize-space(.//span[@class="phone"]/text())').get()

            yield {
                "Name": name,
                "Street Address": street,
                "City Address": city,
                "Phone": phone,
            }

Which python version do you use? I strongly suggest you to switch to python3 which will solve most of your problems with unicode symbols. — Michael Savchenko, Jul 30 '20 at 10:57
I tried it with selenium with utf-8 and it's didn't' work but when I removed utf-8 encoding, the output was same as the website. So, do you know how can I ignore utf-8 encoding with scrapy? — codewithawais, Jul 30 '20 at 12:37
3.7 is nice. And one more question: what's your output? I mean where do you see broken symbols? do you scrape to file/to screen/to database? — Michael Savchenko, Jul 30 '20 at 13:59
In general scrapy gets the page content correctly in 99.99999999% of time. The problem in your case is to show this content correctly. I believe if you'll add some print(name) statements inside of your code you'll see correct letters. — Michael Savchenko, Jul 30 '20 at 14:03
I think the issue is with the exporting data to CSV file. I exported the data to JSON and it worked fine. — codewithawais, Jul 30 '20 at 14:55
hah... it's quite known problem indeed. Which app do you use to open csv file? You should mention Utf encoding there to open file correctly. I know Excel has some troubles with it. Unfortunately I can't guide you in this because I don't use anything except of linux for years already. — Michael Savchenko, Jul 30 '20 at 15:11
Yes, I used excel to open the file. Anyways, thanks a lot. I converted the JSON file to Excel and it worked for me. — codewithawais, Jul 30 '20 at 15:19

score 0 · Answer 1 · answered Jul 29 '20 at 13:58

0

You could add .encode('utf8') after get() or getall()

Scrapy extracts data as unicode strings, this may help you understand abit about unicode and UTF-8.

What is a unicode string?

answered Jul 29 '20 at 13:58

AaronS

2,245
2
6
16

Unfortunately, this did not work. I tried it with selenium with utf-8 and it's didn't' work but when I removed utf-8 encoding, the output was same as the website. So, do you know how can I ignore utf-8 encoding with scrapy? – codewithawais Jul 30 '20 at 12:29

xaander1 · Accepted Answer · 2020-08-06T09:49:20.600

0

Danish codec is cp865 check all available codecs here

NB: Use ascii only if your scraping english website.

def string_cleaner(rouge_text):
    return ("".join(rouge_text.strip()).encode('cp865', 'ignore').decode("cp865"))

Use ignore to ignore errors

Usage

 yield {
                "Name": string_cleaner(name),
                ...
            }

More explanation on what the code does check my code breakdown here

edited Aug 06 '20 at 09:49

answered Jul 30 '20 at 09:16

xaander1

1,064
2
12
40

This method completely skips the letters. NORDJYSK DØGNGALVANISERING AKTIESELSKAB to NORDJYSK DGNGALVANISERING AKTIESELSKAB it has skipped the following letter Ø – codewithawais Jul 30 '20 at 12:32
use the danish code `cp865` instead of `ascii` which is english. Updated the answer – xaander1 Jul 30 '20 at 18:34

Scraping special characters with SCRAPY

2 Answers2