urlopen choking me with newlines

Question

I'm scraping simple textfiles from a url.

def scrape_contents_ex(url):
    data = urllib2.urlopen(url)
    return data.read()

The problem is that the string it yields is choked with newline and tab characters "\t", "\r" etc.

Example:

Here is the webpage

When I print string output in python, it renders with various \characters:

I don't know how to properly handle the output I read from urlopen. I want to store these contents in postgresql. Moreover, I have another complication where the content very likely yield unicode results (chinese characters, cyrillic, etc).

What is the proper and robust way to read and store this?

score 0 · Answer 1 · answered May 01 '16 at 02:40

0

You can use the str.split() method, though there are a lot of options to solve this particular problem.

From the python 3.5.1 docs:

>>> '1,2,3'.split(',')
['1', '2', '3']
>>> '1,2,3'.split(',', maxsplit=1)
['1', '2,3']
>>> '1,2,,3,'.split(',')
['1', '2', '', '3', '']

You would want something like

return data.read().split('\n\t')

The result is a list of strings occurring between any instances of '\n\t' in your original string.

answered May 01 '16 at 02:40

ajthyng

1,245
1
12
18

Totally unrelated to what you asked, but I found the requests library to be much better than urllib2. – ajthyng May 01 '16 at 02:40

score 0 · Answer 2 · answered May 01 '16 at 02:44

0

You need to use libraries 'urllib', 'urllib2' to avoid ecoding.

you can check following link https://docs.python.org/2/howto/urllib2.html

answered May 01 '16 at 02:44

Amitabh Tiwari

1
1

score 0 · Answer 3 · edited May 23 '17 at 12:23

foo is a bytestring in your case. If it represents text; you should decode it into Unicode before storing it in PostgreSQL: text = foo.decode(character_encoding) The charset may depend on the Content-Type. See A good way to get the charset/encoding of an HTTP response in Python.

Then you type foo on the prompt, ipython tries to display the foo object and it may call repr(foo).

What you see: "a\nb" (the result of the repr() call) is a printable representation of the Python object with the type str (type(foo) == str). Python string literals use the same syntax. The backslash is special inside string literals e.g., "\n" is a single character (a newline—ord("\n") == 10). If you want to create a string that contains two character: backslash + n then you have to escape the backslash or use raw string literals:

>>> "\\n" == r"\n" != "\n"
True

urlopen choking me with newlines

3 Answers3