Linux converting my URLs to at signs?

Question

I have a script I've been developing on my mac that uses scrapy, a python library for web scraping. I thought everything was fine until I tried to load it onto the server this morning.

The server runs Debian 8.2 and it scrapes fine. The problem comes with reading its scraped file. Debian seems to read the file as a great number of at signs (@n^@^@^@d^@^@^@e^@^@^@x^@^@^@.^@^@^@p), but uploading the file to Dropbox and looking at it reveals that the file is in fact full of URLs. So the scraping is fine, but the file cannot be read properly.

How can I resolve this?

Larger slice: i^@^@^@n^@^@^@d^@^@^@e^@^@^@x^@^@^@.^@^@^@p^@^@^@h^@^@^@p^@^@^@i^@^@^@n^@^@^@d^@^@^@e^@^@^@x^@^@^@.^@^@^@p^@^@^@h^@^@^@p^@^@^@?^@^@^@s^@^@^@t^@^@^@r^@^@^@P^@^@^@a^@^@^@g^@^@^@e^@^@^@I^@^@^@D^@^@^@=^@^@^@S^@^@^@F^@^@^@0^@^@^@1^@^@^@_^@^@^@0^@^@^@3^@^@^@_^@^@^@0^@^@^@1^@^@^@.^@^@^@.^@^@^@/^@^@^@k^@^@^@o^@^@^@/^@^@^@.^@^@^@.^@^@^@/^@^@^@e^@^@^@n^@^@^@/^@^@^@.^@^@^@.^@^@^@/^@^@^@c^@^@^@n^@^@^@/^@^@^@i^@^@^@n^@^@^@d^@^@^@e^@^@^@x^@^@^@.^@^@^@p^@^@^@h^@^@^@p^@^@^@?

Could this be an artifact due to the setting of the LANG system variable? — mdpc, Dec 24 '15 at 17:04
I used nano to just take a look at it, but technically the script uses sed. — iTry, Dec 24 '15 at 17:04
Can you post a hex-dump of the first 24 or so characters? I have a hunch that it's being stored in full UCS-4 (https://en.wikipedia.org/wiki/UTF-32). — Joel C, Dec 24 '15 at 17:04
I've added a little more, if it were UCS-4, what would be the way of resolving this? — iTry, Dec 24 '15 at 17:10
Have not tested, but try `iconv -f UCS-4 -t UTF-8 infile > outfile` — Joel C, Dec 24 '15 at 17:14

score 2 · Answer 1 · answered Dec 24 '15 at 17:29

Seems to be a problem with UCS-2 (which is, basically, UTF-16). Use encoding='utf16' or encoding='utf_16_be' encoding in you python program (see details here).

You can convert your files from UCS-2 to UTF-8 using iconv utility this way:

iconv -f UCS-2 -t UTF-8 inputfile > outputfile

Linux converting my URLs to at signs?

1 Answers1