3

I have a script I've been developing on my mac that uses scrapy, a python library for web scraping. I thought everything was fine until I tried to load it onto the server this morning.

The server runs Debian 8.2 and it scrapes fine. The problem comes with reading its scraped file. Debian seems to read the file as a great number of at signs (@n^@^@^@d^@^@^@e^@^@^@x^@^@^@.^@^@^@p), but uploading the file to Dropbox and looking at it reveals that the file is in fact full of URLs. So the scraping is fine, but the file cannot be read properly.

How can I resolve this?

Larger slice: i^@^@^@n^@^@^@d^@^@^@e^@^@^@x^@^@^@.^@^@^@p^@^@^@h^@^@^@p^@^@^@i^@^@^@n^@^@^@d^@^@^@e^@^@^@x^@^@^@.^@^@^@p^@^@^@h^@^@^@p^@^@^@?^@^@^@s^@^@^@t^@^@^@r^@^@^@P^@^@^@a^@^@^@g^@^@^@e^@^@^@I^@^@^@D^@^@^@=^@^@^@S^@^@^@F^@^@^@0^@^@^@1^@^@^@_^@^@^@0^@^@^@3^@^@^@_^@^@^@0^@^@^@1^@^@^@.^@^@^@.^@^@^@/^@^@^@k^@^@^@o^@^@^@/^@^@^@.^@^@^@.^@^@^@/^@^@^@e^@^@^@n^@^@^@/^@^@^@.^@^@^@.^@^@^@/^@^@^@c^@^@^@n^@^@^@/^@^@^@i^@^@^@n^@^@^@d^@^@^@e^@^@^@x^@^@^@.^@^@^@p^@^@^@h^@^@^@p^@^@^@?

iTry
  • 135
  • 1
  • 13

1 Answers1

2

Seems to be a problem with UCS-2 (which is, basically, UTF-16). Use encoding='utf16' or encoding='utf_16_be' encoding in you python program (see details here).

You can convert your files from UCS-2 to UTF-8 using iconv utility this way:

iconv -f UCS-2 -t UTF-8 inputfile > outputfile
vrs
  • 1,922
  • 16
  • 23