6
import requests
from bs4 import BeautifulSoup
import re

source_url = requests.get('http://www.nytimes.com/pages/business/index.html')
div_classes = {'class' :['ledeStory' , 'story']}
title_tags = ['h2','h3','h4','h5','h6']

source_text = source_url.text
soup = BeautifulSoup(source_text, 'html.parser')


stories = soup.find_all("div", div_classes)

h = []; h2 = []; h3 = []; h4 =[]

for x in range(len(stories)):

    for x2 in range(len(title_tags)):
        hold = []; hold2 = []
        hold = stories[x].find(title_tags[x2])

        if hold is not None:
            hold2 = hold.find('a')

            if hold2 is not None:
                hh = (((hold.text.strip('a'))).strip())
                h.append(hh)
                #h.append(re.sub(r'[^\x00-\x7f]',r'', ((hold.text.strip('a'))).strip()))
                #h2.append(hold2.get('href'))

    hold = []
    hold = stories[x].find('p')

    if hold is not None:
        h3.append(re.sub(r'[^\x00-\x7f]',r'',((hold.text.strip('p')).strip())))

    else:
        h3.append('None')


h4.append(h)
h4.append(h2)
h4.append(h3)
print(h4)

Hey everyone. I have been wanting to scrape some data, I almost completed my scraper when I noticed the printed output was replacing (') with (â\x80\x99). For example the title containing "China's" was coming out "Chinaâ\x80\x99s". I did some research and tried to use decode/encode (utf-8) with no avail. It would just tell me that you can not run decode on a str(). I tried using re.sub() which would let me delete (â\x80\x99) but would not let me replace it with a (') Since I want to use natural language processing to interpret the data a fear that not having apostrophes is greatly going to change the meaning. Help would be greatly appreciated, I feel like I have hit a block with this one.

muraaby
  • 105
  • 2
  • 11

2 Answers2

4

In ISO 8859-1 and related code sets (there are many of them), â has code point 0xE2. When you interpret the three bytes 0xE2, 0x80, 0x99 as a UTF-8 encoding, the character is U+2019, RIGHT SINGLE QUOTATION MARK (which is ’ or , as distinct from ' or ' — you may or may not be able to spot the difference).

I see a few possibilities for the source of your difficulties, any one or more of which could be the source of your trouble:

  1. Your terminal is not set up to interpret UTF-8.
  2. Your source code should use ' (U+0027, APOSTROPHE).
  3. You're using Python 2.x rather than Python 3.x and it is having issues because of the use of Unicode (UTF-8). Against this (as Cory Madden pointed out), the code ends with print(h4) which is Python 3, so it probably isn't the issue.

It may be simplest to change the quotation mark into an ASCII apostrophe.

On the other hand, if you are analyzing HTML from elsewhere, you may have to consider how your script is going to handle UTF-8. Using quote marks from the Unicode U+20xx range is a very common choice; maybe your scraper needs to handle it?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • The last line of the code example implies it's Python 3.x – Cory Madden Aug 07 '17 at 03:50
  • 1
    @CoryMadden: yes, you're right. So option 3 is probably (almost certainly) not relevant. – Jonathan Leffler Aug 07 '17 at 03:52
  • 1
    Oops, I misread it as ALL of those criteria were the problem. – Cory Madden Aug 07 '17 at 03:53
  • I will look into how to set up my terminal to interpret UTF-8. Could you explain more on the source code using ' ? Im not really sure how to correct that. – muraaby Aug 07 '17 at 05:14
  • "It may be simplest to change the quotation mark into an ASCII apostrophe" do you mind showing me how I could do this? – muraaby Aug 07 '17 at 05:17
  • I don't see where in the code there could be the problem, but you're scraping HTML, so I'm now guessing that the trouble is that the scraped web pages contain the UTF-8 and somewhere in your system the UTF-8 is not being handled correctly. At that point, you start moving out of my area of expertise — it's borderline on this, but I can recognize and interpret byte sequences, but then working out the fix is trickier. – Jonathan Leffler Aug 07 '17 at 05:20
  • Thank you for your help Jonathan. It brings me one step closer. Ill try and tweak it tomorrow and see if I can find a way to have it handled correctly. – muraaby Aug 07 '17 at 05:25
2

I have come across the same problem while scraping data with requests, then parsing it with BeautifulSoup.

This solution from here works well for me:

soup = BeautifulSoup(r.content.decode('utf-8'),"lxml")

If this doesn't work, adding .encode('latin1').decode('utf-8') after the .get_text() or .text also solves the issue.

Ayşe Nur
  • 493
  • 7
  • 11