0

I'm scraping simple textfiles from a url.

def scrape_contents_ex(url):
    data = urllib2.urlopen(url)
    return data.read() 

The problem is that the string it yields is choked with newline and tab characters "\t", "\r" etc.

Example:

Here is the webpage enter image description here

When I print string output in python, it renders with various \characters:

enter image description here

I don't know how to properly handle the output I read from urlopen. I want to store these contents in postgresql. Moreover, I have another complication where the content very likely yield unicode results (chinese characters, cyrillic, etc).

What is the proper and robust way to read and store this?

user3556757
  • 3,469
  • 4
  • 30
  • 70

3 Answers3

0

You can use the str.split() method, though there are a lot of options to solve this particular problem.

From the python 3.5.1 docs:

>>> '1,2,3'.split(',')
['1', '2', '3']
>>> '1,2,3'.split(',', maxsplit=1)
['1', '2,3']
>>> '1,2,,3,'.split(',')
['1', '2', '', '3', '']

You would want something like

return data.read().split('\n\t')

The result is a list of strings occurring between any instances of '\n\t' in your original string.

ajthyng
  • 1,245
  • 1
  • 12
  • 18
  • Totally unrelated to what you asked, but I found the requests library to be much better than urllib2. – ajthyng May 01 '16 at 02:40
0

You need to use libraries 'urllib', 'urllib2' to avoid ecoding.

you can check following link https://docs.python.org/2/howto/urllib2.html

0

foo is a bytestring in your case. If it represents text; you should decode it into Unicode before storing it in PostgreSQL: text = foo.decode(character_encoding) The charset may depend on the Content-Type. See A good way to get the charset/encoding of an HTTP response in Python.

Then you type foo on the prompt, ipython tries to display the foo object and it may call repr(foo).

What you see: "a\nb" (the result of the repr() call) is a printable representation of the Python object with the type str (type(foo) == str). Python string literals use the same syntax. The backslash is special inside string literals e.g., "\n" is a single character (a newline—ord("\n") == 10). If you want to create a string that contains two character: backslash + n then you have to escape the backslash or use raw string literals:

>>> "\\n" == r"\n" != "\n"
True
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670