0

I'm creating a Python script, basically this part I'm having problems, it simply takes the titles of the posts of a webpage. Python does not understand the accents and I've tried everything I know 1 - put this code in the first line # - * - coding: utf-8 - * - 2 - put .encode ("utf-8")

code:

# -*- coding: utf-8 -*- 
import re
import requests

def opena(url):
    headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
    lexdan1 = requests.get(url,headers=headers)
    lexdan2 = lexdan1.text
    lexdan1.close
    return lexdan2
dan = []
a = opena('http://www.megafilmesonlinehd.com/filmes-lancamentos')
d = re.compile('<strong class="tt-filme">(.+?)</strong>').findall(a)
for name in d:
    name =  name.encode("utf-8")
    dan.append(name)
print dan

this what i got:

['Porta dos Fundos: Contrato Vital\xc3\xadcio HD 720p', 'Os 28 Homens de Panfilov Legendado HD', 'Estrelas Al\xc3\xa9m do Tempo Dublado', 'A Volta do Ju\xc3\xadzo Final Dublado Full HD 1080p', 'The Love Witch Legendado HD', 'Manchester \xc3\x80 Beira-Mar Legendado', 'Semana do P\xc3\xa2nico Dublado HD 720p', 'At\xc3\xa9 o \xc3\x9altimo Homem Legendado HD 720p', 'Arbor Demon Legendado HD 720p', 'Esquadr\xc3\xa3o de Elite Dublado Full HD 1080p', 'Ouija Origem do Mal Dublado Full HD 1080p', 'As Muitas Mulheres da Minha Vida Dublado HD 720p', 'Um Novo Desafio para Callan e sua Equipe Dublado Full HD 1080p', 'Terror Herdado Dublado DVDrip', 'Officer Downe Legendado HD', 'N\xc3\xa3o Bata Duas Vezes Legendado HD', 'Eu, Daniel Blake Legendado HD', 'Sangue Pela Gl\xc3\xb3ria Legendado', 'Quase 18 Legendado HD 720p', 'As Aventuras de Robinson Cruso\xc3\xa9 Dublado Full HD 1080p', 'Indigna\xc3\xa7\xc3\xa3o Dublado HD 720p']
harryscholes
  • 1,617
  • 16
  • 18
sonaldynho
  • 11
  • 4
  • Have you tried printing the resulting string (literally place a `print` before it)? EDIT: just saw it... Try and print `dan[0]`. For instance, in 2.7 and after getting the same output as you, `>>> print dan[0]` got me `Porta dos Fundos: Contrato Vitalício HD 720p` and `>>> print dan[1]` `Os 28 Homens de Panfilov Legendado HD`. – berna1111 Feb 08 '17 at 22:32
  • Stack OverFlow has a Spanish and Portuguese website, I believe. –  Feb 08 '17 at 23:04

2 Answers2

1

Because you're telling the interpreter to print a list, the interpreter calls the list class's __str__ method. When you call a container's __str__method, it uses uses the __repr__ method for each of the contained objects (in this case - str type). The str type's __repr__ method doesn't convert the unicode characters, but its __str__ method (which gets called when you print an individual str object) does.

Here's a great question to help explain the difference: Difference between __str__ and __repr__ in Python

If you print each string individually, you should get the results you want.

import re
import requests

def opena(url):
    headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
    lexdan1 = requests.get(url,headers=headers)
    lexdan2 = lexdan1.text
    lexdan1.close
    return lexdan2

dan = []
a = opena('http://www.megafilmesonlinehd.com/filmes-lancamentos')
d = re.compile('<strong class="tt-filme">(.+?)</strong>').findall(a)
for name in d:
    dan.append(name)
for item in dan:
     print item
Community
  • 1
  • 1
wpercy
  • 9,636
  • 4
  • 33
  • 45
0

When printing a list whatever is inside them is represented (calls __repr__ method), and not printed (call __str__ method):

class test():
    def __repr__(self):
        print '__repr__'
        return ''
    def __str__(self):
        print '__str__'
        return ''

will get you:

>>> a = [test()]
>>> a
[__repr__
]
>>> print a
[__repr__
]
>>> print a[0]
__str__

And the __repr__ method of string does not convert special characters (not even \t or \n).

berna1111
  • 1,811
  • 1
  • 18
  • 23