Python Requests Reading text

Question

I am trying to learn text processing. And using nltk. Trying to follow the NLTK book. When I try to read a text, it is reading it a little different.

import requests
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = requests.get(url)
response.text[:25]

How can I read the text without the highlighted part in the image uploaded.

Try slicing differently? `[1:25]`? – OneCricketeer Jun 09 '18 at 15:09 — OneCricketeer, Jun 09 '18 at 15:09

score 0 · Answer 1 · answered Jun 09 '18 at 15:23

The simple answer is to print it and not put it just in the shell:

print(response.text[:25])

Should print:

The Project Gutenberg E8

The shell does repr on the value to find out what it should print

print(repr(response.text[25]))

will again print:

'\ufeffThe Project Gutenberg E8'

score 0 · Accepted Answer · answered Jun 09 '18 at 15:29

0

This is a unicode format that you're seeing here.

What you should do is, convert the unicode string to ascii with ignore if not ascii.

Example:

a=u'\uffefHello World'
print(a.encode('ascii', 'ignore'))
"Hello World"

answered Jun 09 '18 at 15:29

yadavankit

import requests url = "http://www.gutenberg.org/files/2554/2554-0.txt" response = requests.get(url) raw = response.text.encode('ascii', 'ignore') print(raw[:25]). prints b'The Project Gutenberg EBo'. what is that b in the beginning? – Tronald Dump Jun 09 '18 at 15:57
that's for `bytes`. Hope this will help https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal – yadavankit Jun 09 '18 at 19:02

2 Answers2