1

I am using Scrapy to crawl articles from News Website and add it to mongoDB. But while inserting i got unicode characters in MongoDb like this

"article": "Satya Nadella, Microsoft\u2019s executive vice president of cloud and enterprise, has just been named the company\u2019s next CEO.

I have tried

FEED_EXPORT_ENCODING = "utf-8"

But it only worked when i run crawler and export data as JSON File not when storing Data in MongoDB

In spider.py file i wrote this line of code to get article

item["article"]=response.xpath('//p/text()').getall()

item["article"] =' '.join(item['article'])

How to replace these characters with their ASCII equivalent ?

carl
  • 77
  • 1
  • 11
  • `\2019` has not ASCII equivalent, there is just `'` which is looks a bit similar. And actually: what is bad about Unicode? – Klaus D. May 03 '19 at 10:24
  • Possible duplicate of [Convert Unicode to ASCII without errors in Python](https://stackoverflow.com/questions/2365411/convert-unicode-to-ascii-without-errors-in-python) – bv_Martn May 03 '19 at 10:24
  • I want to show this text to my web but it's showing \2019 – carl May 03 '19 at 10:26
  • Then I guess the way you are showing it is not correct. Here in Stack Overflow it is very important to explain the original problem instead of describing the troubles you are having with your (maybe flawed) solution to it. – Klaus D. May 03 '19 at 10:29
  • @bv_Martn let me try that , if it works – carl May 03 '19 at 10:30
  • @KlausD. there are other unicode characters also that is storing in MongoDB like `\u201d ` I have tried `encode('ascii', 'ignore')` but now its showing article: – carl May 03 '19 at 10:42
  • Why are trying to convert them instead fixing the display problem? The data properly encoded and every browser can display Unicode. – Klaus D. May 03 '19 at 10:51
  • ok let me check the above text in my browser – carl May 03 '19 at 10:57
  • the browser is showing `\u2019` – carl May 03 '19 at 11:04
  • Show us enough code to reproduce how you are displaying the data. – Klaus D. May 03 '19 at 11:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/192771/discussion-between-carl-and-klaus-d). – carl May 03 '19 at 11:09
  • ` a=unidecode.unidecode( "Satya Nadella, Microsoft\u2019s executive vice president of cloud and enterprise, has just been named the company\u2019s next CEO.") ` worked for me finally – carl May 03 '19 at 11:38

1 Answers1

1

This solution worked for me (Character encoding in python to replace 'u2019' with ')

import unidecode 

a=unidecode.unidecode( "Satya Nadella, Microsoft\u2019s executive vice president of cloud and enterprise, has just been named the company\u2019s next CEO.")
carl
  • 77
  • 1
  • 11
  • 1
    `unidecode` can be used as an output processor for text fields. See https://doc.scrapy.org/en/latest/topics/loaders.html#input-and-output-processors – Gallaecio May 03 '19 at 12:17