0

I am trying to learn text processing. And using nltk. Trying to follow the NLTK book. When I try to read a text, it is reading it a little different.

import requests
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = requests.get(url)
response.text[:25]

enter image description here

How can I read the text without the highlighted part in the image uploaded.

Tronald Dump
  • 1,300
  • 3
  • 16
  • 27

2 Answers2

0

The simple answer is to print it and not put it just in the shell:

print(response.text[:25])

Should print:

The Project Gutenberg E8

The shell does repr on the value to find out what it should print

print(repr(response.text[25]))

will again print:

'\ufeffThe Project Gutenberg E8'
MegaIng
  • 7,361
  • 1
  • 22
  • 35
0

This is a unicode format that you're seeing here.

What you should do is, convert the unicode string to ascii with ignore if not ascii.

Example:

a=u'\uffefHello World'
print(a.encode('ascii', 'ignore'))
"Hello World"
yadavankit
  • 353
  • 2
  • 14
  • import requests url = "http://www.gutenberg.org/files/2554/2554-0.txt" response = requests.get(url) raw = response.text.encode('ascii', 'ignore') print(raw[:25]). prints b'The Project Gutenberg EBo'. what is that b in the beginning? – Tronald Dump Jun 09 '18 at 15:57
  • that's for `bytes`. Hope this will help https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal – yadavankit Jun 09 '18 at 19:02