0

I am trying to scrape some data from a website in german. The code is as follows:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib2, requests
import time, sys
from selenium import webdriver
import os, sys

reload(sys)
sys.setdefaultencoding('utf-8')
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
URL = 'http://de.vroniplag.wikia.com/'

def gethtml(link):

    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


Soup = lambda x: BeautifulSoup(x, 'html.parser')

def getplagtags(url):

    soup = Soup(gethtml(url))
    frgtab = soup.find('table', attrs={'class': 'ueberpruefte-fragmentseiten'})
    frgs = [e['href'] for e in frgtab.findAll('a') if e['href'] and 'Seite nicht vorhanden' not in e['title']]

    lst=[]
    driver = webdriver.Chrome(chromedriver)
    print frgs
    for frg in frgs[0:1]:
        url=URL+frg[1:]
        print url
        driver.get(url)
        for tag in driver.find_elements_by_css_selector('[class^=fragmark]'):
            lst.append([tag.get_attribute('class'),tag.text.encode('utf-8')])
    driver.quit()
    print lst


getplagtags('http://de.vroniplag.wikia.com/wiki/Aaf')

The result is as follows:

[[u'fragmark1', 'Verursacher von Infektionen in Krankenh\xc3\xa4usern und'], [u'fragmark2', 'auch in der Bev\xc3\xb6lkerung.'], [u'fragmark3', 'zwei Jahre nach der'], [u'fragmark4', 'des semisynthetischen Penicillin Methicillin,'], [u'fragmark5', 'zur Behandlung von Penicillin-resistentem S. aureus'], [u'fragmark6', 'in einem Krankenhaus in Boston'], [u'fragmark7', 'im Jahr 2006 circa 19.000 Menschen an MRSA-Infektionen.'], [u'fragmark8', 'die Zahl der station\xc3\xa4r behandelten MRSA-Infektionen'], [u'fragmark9', 'zwischen 700 und 1.500 Personen an einer'], [u'fragmark1', 'Die Zahl der Infizierten'], [u'fragmark2', 'Mortalit\xc3\xa4t der Patienten durch schwerwiegende Erkrankungen wie'], [u'fragmark3', 'oder Staphylococcal Scaled Skin Syndrome.'], [u'fragmark4', 'Kosten f\xc3\xbcr das Gesundheitssystem.'], [u'fragmark5', 'in der gegenw\xc3\xa4rtigen Forschung'], [u'fragmark6', 'sind.'], [u'fragmark6', '1.1. Methicillin-resistenter'], [u'fragmark7', 'durch Resistenzen gegen\xc3\xbcber allen Betalaktamantibiotika'], [u'fragmark8', 'als minimale Oxacillin-Hemmkonzentration von \xe2\x89\xa5 4 \xce\xbcg/mL.'], [u'fragmark9', 'einem mobilen genetischen Element.'], [u'fragmark1', 'Durch den Repressor MecI und den Tranducer MecRi'], [u'fragmark2', 'bekannt. Das mecA-Gen kodiert f\xc3\xbcr ein'], [u'fragmark7', 'Jevons MP, Coe AW, Parker MT. Methicillin resistance in staphylococci. Lancet 1963; 1:904-907'], [u'fragmark3', 'Barber M. Methicillin resistant staphylococci. J Clin Path'], [u'fragmark4', 'Barrett FF, McGehee RF Jr, Finland M. Methicillin-resistant Staphylococcus aureus at Boston City Hospital. Bacteriologic and epidemiologic observations. N Engl J Med 1968; 279;441-448'], [u'fragmark1', 'Klevens et al.: Invasive Methicillin-Resistant Staphylococcus aureus Infections in the United States. JAMA 298/15/2007. S. 1763'], [u'fragmark8', 'Klein E, Smith DL, Laxmiranayan R. Hospitalizations and deaths caused by Methicillin-resistant Staphylococcus aureus, United States, 1999'], [u'fragmark9', 'Infect Dis 2007; 13(12):1840-1846'], [u'fragmark2', 'Noskin GA, Rubin RJ,'], [u'fragmark3', 'et al. The burden of Staphylococcus'], [u'fragmark4', 'on hospitals in the United States: an analysis of the 2000 and 2001 Nationwide'], [u'fragmark5', 'Sample Database. Arch Intern Med 2005; 165:1756-1761'], [u'fragmark5', 'Deurenberg RH, Stobberingh EE. The evolution of Staphylococcus aureus. Infect'], [u'fragmark6', 'Evol. 2008 Jul 29.'], [u'fragmark1', 'Verursacher von Infektionen in Krankenh\xc3\xa4usern und'], [u'fragmark2', 'auch in der Bev\xc3\xb6lkerung.'], [u'fragmark3', 'zwei Jahre nach der'], [u'fragmark4', 'des semisynthetischen Penicillin Methicillin'], [u'fragmark5', 'zur Behandlung von Penicillin-resistentem S. aureus'], [u'fragmark6', 'in einem Krankenhaus in Boston'], [u'fragmark7', 'im Jahr 2006 circa 19.000 Menschen an MRSA-Infektionen'], [u'fragmark8', 'die Zahl der station\xc3\xa4r behandelten MRSA-Infektionen'], [u'fragmark9', 'zwischen 700 und 1.500 Personen an einer'], [u'fragmark1', 'die Zahl der Infizierten'], [u'fragmark2', 'Mortalit\xc3\xa4t der Patienten durch schwerwiegende Erkrankungen wie'], [u'fragmark3', 'oder Staphylococcal Scaled Skin Syndrome'], [u'fragmark4', 'Kosten f\xc3\xbcr das Gesundheitssystem,'], [u'fragmark5', 'in der gegenw\xc3\xa4rtigen Forschung'], [u'fragmark6', 'sind.'], [u'fragmark6', '1.1 Methicillin-resistenter'], [u'fragmark7', 'durch Resistenzen gegen\xc3\xbcber allen Betalaktamantibiotika'], [u'fragmark8', 'als minimale Oxacillin-Hemmkonzentration von \xe2\x89\xa5 4 \xce\xbcg/mL.'], [u'fragmark9', 'einem mobilen genetischen Element,'], [u'fragmark1', 'durch den Repressor MecI und den Tranducer MecRi'], [u'fragmark2', 'bekannt. Das mecA-Gen kodiert f\xc3\xbcr ein'], [u'fragmark3', 'Barber M. Methicillin resistant staphylococci. J Clin Path'], [u'fragmark4', 'Barrett FF, McGehee RF Jr, Finland M. Methicillin-resistant Staphylococcus aureus at Boston City Hospital. Bacteriologic and epidemiologic observations. N Engl J Med 1968; 279;441-448'], [u'fragmark5', 'Deurenberg RH, Stobberingh EE. The evolution of Staphylococcus aureus. Infect'], [u'fragmark6', 'Evol. 2008 Jul 29'], [u'fragmark7', 'Jevons MP, Coe AW, Parker MT. Methicillin resistance in staphylococci. Lancet 1963; 1:904-907'], [u'fragmark8', 'Klein E, Smith DL, Laxmiranayan R. Hospitalizations and deaths caused by Methicillin-resistant Staphylococcus aureus, United States, 1999'], [u'fragmark9', 'Infect Dis 2007; 13(12):1840-1846'], [u'fragmark1', 'Klevens et al.: Invasive Methicillin-Resistant Staphylococcus aureus Infections in the United States. JAMA 298/15/2007. S. 1763'], [u'fragmark2', 'Noskin GA, Rubin RJ,'], [u'fragmark3', 'et al. The burden of Staphylococcus'], [u'fragmark4', 'on hospitals in the United States: an analysis of the 2000 and 2001 Nationwide'], [u'fragmark5', 'Sample Database. Arch Intern Med 2005; 165:1756- 1761']]

My question is why is the text in the result (the second element in each list) is not in unicode, despite the fact that I am using the encode function.

UPDATE: Removed the setdefaultencoding and the encode function. Now I get the following result

[[u'fragmark1', u'Verursacher von Infektionen in Krankenh\xe4usern und'], [u'fragmark2', u'auch in der Bev\xf6lkerung.'], [u'fragmark3', u'zwei Jahre nach der'], [u'fragmark4', u'des semisynthetischen Penicillin Methicillin,'], [u'fragmark5', u'zur Behandlung von Penicillin-resistentem S. aureus'], [u'fragmark6', u'in einem Krankenhaus in Boston'], [u'fragmark7', u'im Jahr 2006 circa 19.000 Menschen an MRSA-Infektionen.'], [u'fragmark8', u'die Zahl der station\xe4r behandelten MRSA-Infektionen'], [u'fragmark9', u'zwischen 700 und 1.500 Personen an einer'], [u'fragmark1', u'Die Zahl der Infizierten'], [u'fragmark2', u'Mortalit\xe4t der Patienten durch schwerwiegende Erkrankungen wie'], [u'fragmark3', u'oder Staphylococcal Scaled Skin Syndrome.'], [u'fragmark4', u'Kosten f\xfcr das Gesundheitssystem.'], [u'fragmark5', u'in der gegenw\xe4rtigen Forschung'], [u'fragmark6', u'sind.'], [u'fragmark6', u'1.1. Methicillin-resistenter'], [u'fragmark7', u'durch Resistenzen gegen\xfcber allen Betalaktamantibiotika'], [u'fragmark8', u'als minimale Oxacillin-Hemmkonzentration von \u2265 4 \u03bcg/mL.'], [u'fragmark9', u'einem mobilen genetischen Element.'], [u'fragmark1', u'Durch den Repressor MecI und den Tranducer MecRi'], [u'fragmark2', u'bekannt. Das mecA-Gen kodiert f\xfcr ein'], [u'fragmark7', u'Jevons MP, Coe AW, Parker MT. Methicillin resistance in staphylococci. Lancet 1963; 1:904-907'], [u'fragmark3', u'Barber M. Methicillin resistant staphylococci. J Clin Path'], [u'fragmark4', u'Barrett FF, McGehee RF Jr, Finland M. Methicillin-resistant Staphylococcus aureus at Boston City Hospital. Bacteriologic and epidemiologic observations. N Engl J Med 1968; 279;441-448'], [u'fragmark1', u'Klevens et al.: Invasive Methicillin-Resistant Staphylococcus aureus Infections in the United States. JAMA 298/15/2007. S. 1763'], [u'fragmark8', u'Klein E, Smith DL, Laxmiranayan R. Hospitalizations and deaths caused by Methicillin-resistant Staphylococcus aureus, United States, 1999'], [u'fragmark9', u'Infect Dis 2007; 13(12):1840-1846'], [u'fragmark2', u'Noskin GA, Rubin RJ,'], [u'fragmark3', u'et al. The burden of Staphylococcus'], [u'fragmark4', u'on hospitals in the United States: an analysis of the 2000 and 2001 Nationwide'], [u'fragmark5', u'Sample Database. Arch Intern Med 2005; 165:1756-1761'], [u'fragmark5', u'Deurenberg RH, Stobberingh EE. The evolution of Staphylococcus aureus. Infect'], [u'fragmark6', u'Evol. 2008 Jul 29.'], [u'fragmark1', u'Verursacher von Infektionen in Krankenh\xe4usern und'], [u'fragmark2', u'auch in der Bev\xf6lkerung.'], [u'fragmark3', u'zwei Jahre nach der'], [u'fragmark4', u'des semisynthetischen Penicillin Methicillin'], [u'fragmark5', u'zur Behandlung von Penicillin-resistentem S. aureus'], [u'fragmark6', u'in einem Krankenhaus in Boston'], [u'fragmark7', u'im Jahr 2006 circa 19.000 Menschen an MRSA-Infektionen'], [u'fragmark8', u'die Zahl der station\xe4r behandelten MRSA-Infektionen'], [u'fragmark9', u'zwischen 700 und 1.500 Personen an einer'], [u'fragmark1', u'die Zahl der Infizierten'], [u'fragmark2', u'Mortalit\xe4t der Patienten durch schwerwiegende Erkrankungen wie'], [u'fragmark3', u'oder Staphylococcal Scaled Skin Syndrome'], [u'fragmark4', u'Kosten f\xfcr das Gesundheitssystem,'], [u'fragmark5', u'in der gegenw\xe4rtigen Forschung'], [u'fragmark6', u'sind.'], [u'fragmark6', u'1.1 Methicillin-resistenter'], [u'fragmark7', u'durch Resistenzen gegen\xfcber allen Betalaktamantibiotika'], [u'fragmark8', u'als minimale Oxacillin-Hemmkonzentration von \u2265 4 \u03bcg/mL.'], [u'fragmark9', u'einem mobilen genetischen Element,'], [u'fragmark1', u'durch den Repressor MecI und den Tranducer MecRi'], [u'fragmark2', u'bekannt. Das mecA-Gen kodiert f\xfcr ein'], [u'fragmark3', u'Barber M. Methicillin resistant staphylococci. J Clin Path'], [u'fragmark4', u'Barrett FF, McGehee RF Jr, Finland M. Methicillin-resistant Staphylococcus aureus at Boston City Hospital. Bacteriologic and epidemiologic observations. N Engl J Med 1968; 279;441-448'], [u'fragmark5', u'Deurenberg RH, Stobberingh EE. The evolution of Staphylococcus aureus. Infect'], [u'fragmark6', u'Evol. 2008 Jul 29'], [u'fragmark7', u'Jevons MP, Coe AW, Parker MT. Methicillin resistance in staphylococci. Lancet 1963; 1:904-907'], [u'fragmark8', u'Klein E, Smith DL, Laxmiranayan R. Hospitalizations and deaths caused by Methicillin-resistant Staphylococcus aureus, United States, 1999'], [u'fragmark9', u'Infect Dis 2007; 13(12):1840-1846'], [u'fragmark1', u'Klevens et al.: Invasive Methicillin-Resistant Staphylococcus aureus Infections in the United States. JAMA 298/15/2007. S. 1763'], [u'fragmark2', u'Noskin GA, Rubin RJ,'], [u'fragmark3', u'et al. The burden of Staphylococcus'], [u'fragmark4', u'on hospitals in the United States: an analysis of the 2000 and 2001 Nationwide'], [u'fragmark5', u'Sample Database. Arch Intern Med 2005; 165:1756- 1761']]
Echchama Nayak
  • 971
  • 3
  • 23
  • 44
  • 1
    It's not unicode **because** you encoded it. – Stefan Pochmann Jul 02 '16 at 17:19
  • You successfully encoded to UTF-8, which is why that second element in each sublist does **not** start with a `u` and contains `\xhh` representations of bytes that fall outside of the ASCII range. `\xc3\xa4` is the *representation* for the C3 A4 bytes that are the UTF-8 encoding for the U+00E4 `ä` codepoint. – Martijn Pieters Jul 02 '16 at 17:29
  • @MartijnPieters So how can I convert those characters? – Echchama Nayak Jul 02 '16 at 17:40
  • @EchchamaNayak: what output did you *expect*? There is nothing to convert if you expected to have UTF-8 encoded data in your lists. You may have misunderstood how Python shows you what you have in your lists, but the actual contents are still correctly encoded UTF-8 bytestrings. – Martijn Pieters Jul 02 '16 at 17:48
  • @MartijnPieters As you said the \xa4 is ä. So how do I get that in the final output. Because I need to perform an operation with this text. – Echchama Nayak Jul 02 '16 at 17:53
  • @EchchamaNayak: No, `\xa4` is **one** byte in a two-byte UTF-8 encoding for `ä`. Why are you encoding to UTF-8 in the first place? – Martijn Pieters Jul 02 '16 at 17:54
  • @MartijnPieters Sorry if I am not clear. My intention is to extract the German text as it is published on the website. Yet I get these obfuscations. How do I get the ä and not \x characters – Echchama Nayak Jul 02 '16 at 17:56
  • 1
    @EchchamaNayak: you are printing a list object. The representation of any standard Python container is to use `repr()` on each element, and you are looking at that representation. If you must further manipulate the text, **don't encode**, keep the text as Unicode. Don't print the list objects, print individual values. You have the right data, you are just getting confused over the (debug) output you see. – Martijn Pieters Jul 02 '16 at 17:58

1 Answers1

2

You decode from str to unicode and encode from unicode to str. tag.text.encode('utf-8') obviously gives you exactly what you asked for, just call tag.text as you already have a unicode string:

 [tag.get_attribute('class'),tag.text]

Also avoid reload(sys) and sys.setdefaultencoding('utf-8'), why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script

Community
  • 1
  • 1
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321