6

I have a url such as

http://example.com/here/there/index.html

now I want to save a file and its content in a directory. I want the name of the file to be :

http://example.com/here/there/index.html

but I get error, I'm guessing that error is as the result of / in the url name.

This is what I'm doing at the moment.

        with open('~/' + response.url, 'w') as f:
            f.write(response.body)

any ideas how I should do it instead?

E_net4
  • 27,810
  • 13
  • 101
  • 139
nafas
  • 5,283
  • 3
  • 29
  • 57
  • 1
    `I want the name of the file to be :...` why? – njzk2 Dec 02 '14 at 16:10
  • That's what I thought and then my answer was downvotet :D I think a lot of problems just exists because there is a suitable detour in one's head. – wenzul Dec 02 '14 at 16:17
  • @njzk2 well the reason was I gonna download several pages in a folder, it would be much easier to refer to a url if you have the name of the as the file name. this way I don't have to do some crazy hashMap (or something else) to each file – nafas Dec 02 '14 at 16:28
  • so what you actually want is a filename that is uniquely related to the filename without any extra data. the answer from @ReutSharabani is a good solution – njzk2 Dec 02 '14 at 16:33
  • @njzk2 yeah pretty much, as index.html won't be unique. answer from Reut Sharabani was great but unfortunately encoder results can sometime contain **/** which produces same problem – nafas Dec 02 '14 at 16:37
  • If you need just the one way from url to filename you could also use the hash as filename. – wenzul Dec 02 '14 at 16:41

5 Answers5

31

You could use the reversible base64 encoding.

>>> import base64
>>> base64.b64encode('http://example.com/here/there/index.html')
'aHR0cDovL2V4YW1wbGUuY29tL2hlcmUvdGhlcmUvaW5kZXguaHRtbA=='
>>> base64.b64decode('aHR0cDovL2V4YW1wbGUuY29tL2hlcmUvdGhlcmUvaW5kZXguaHRtbA==')
'http://example.com/here/there/index.html'

or perhaps binascii

>>> binascii.hexlify(b'http://example.com/here/there/index.html')
'687474703a2f2f6578616d706c652e636f6d2f686572652f74686572652f696e6465782e68746d6c'
>>> binascii.unhexlify('687474703a2f2f6578616d706c652e636f6d2f686572652f74686572652f696e6465782e68746d6c')
'http://example.com/here/there/index.html'
Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88
  • I wish I could double vote you mate, +10 for answering what I was looking for . it worked like a charm – nafas Dec 02 '14 at 16:12
  • I'm assuming this solution would be used for more than 1 URL. However, seeing as `base64` encoding/decoding is not unique (https://stackoverflow.com/questions/30429168/is-a-base64-encoded-string-unique), then you could end up having different URL overwriting each other! – pir Jun 25 '17 at 02:15
  • Also, couldn't you easily have issues with file length if encoding longer URLs (https://serverfault.com/questions/9546/filename-length-limits-on-linux)? – pir Jun 25 '17 at 02:18
  • I would really fix these issues or refer to the accepted answer as being the better solution. – pir Jun 25 '17 at 02:23
  • You **could** have many problems with the accepted answer as well (and very much the same). Can you give an example of an actual problem here that isn't a problem in the accepted answer? – Reut Sharabani Jun 25 '17 at 04:43
  • 1
    I used this code for a large number of URLs and found that some of them got encoded to strings that were too long. I don't have the same issue with the accepted answer. I don't think the accepted answer has an issue with the encoding/decoding not being unique, which this one has (see the link I posted in another comment). – pir Jun 26 '17 at 02:37
  • can you give an example of it not being unique? remember - you can have multiple results per filename if you can decode it. – Reut Sharabani Jun 26 '17 at 04:29
  • 1
    About length: the accepted answer has the same flaw. The generated filrname will be longer than the url as well. – Reut Sharabani Jun 26 '17 at 04:30
  • yay to `binascii`, `base64.b64encode` uses `/` so its a no go on linux for filenames – CpILL Jun 15 '20 at 09:41
9

You have several problems. One of them is that Unix shell abbreviations (~) are not going to be auto-interpreted by Python as they are in Unix shells.

The second is that you're not going to have good luck writing a file path in Unix that has embedded slashes. You will need to convert them to something else if you're going to have any luck of retrieving them later. You could do that with something as simple as response.url.replace('/','_'), but that will leave you with many other characters that are also potentially problematic. You may wish to "sanitize" all of them on one shot. For example:

import os
import urllib

def write_response(response, filedir='~'):
    filedir = os.path.expanduser(dir)
    filename = urllib.quote(response.url, '')
    filepath = os.path.join(filedir, filename)
    with open(filepath, "w") as f:
        f.write(response.body)

This uses os.path functions to clean up the file paths, and urllib.quote to sanitize the URL into something that could work for a file name. There is a corresponding unquote to reverse that process.

Finally, when you write to a file, you may need to tweak that a bit depending on what the responses are, and how you want them written. If you want them written in binary, you'll need "wb" not just "w" as the file mode. Or if it's text, it might need some sort of encoding first (e.g., to utf-8). It depends on what your responses are, and how they are encoded.

Edit: In Python 3, urllib.quote is now urllib.parse.quote.

Mark Peschel
  • 115
  • 3
  • 10
Jonathan Eunice
  • 21,653
  • 6
  • 75
  • 77
  • thanks alot mate, eventhough @Reut Sharabani answer was great, this is better and more robost – nafas Dec 02 '14 at 16:20
4

This is a bad idea as you will hit 255 byte limit for filenames as urls tend to be very long and even longer when b64encoded!

You can compress and b64 encode but it won't get you very far:

from base64 import b64encode 
import zlib
import bz2
from urllib.parse import quote

def url_strategies(url):
    url = url.encode('utf8')
    print(url.decode())
    print(f'normal  : {len(url)}')
    print(f'quoted  : {len(quote(url, ""))}')
    b64url = b64encode(url)
    print(f'b64     : {len(b64url)}')
    url = b64encode(zlib.compress(b64url))
    print(f'b64+zlib: {len(url)}')
    url = b64encode(bz2.compress(b64url))
    print(f'b64+bz2: {len(url)}')

Here's an average url I've found on angel.co:


URL = 'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'

And even with b64+zlib it doesn't fit into 255 limit:

normal  : 316
quoted  : 414
b64     : 424
b64+zlib: 304
b64+bz2 : 396

Even with the best strategy of zlib compression and b64encode you'd still be in trouble.

Proper Solution

Alternatively what you should do is hash the url and attach url as file attribute to the file:

import os
from hashlib import sha256

def save_file(url, content, char_limit=13):
    # hash url as sha256 13 character long filename
    hash = sha256(url.encode()).hexdigest()[:char_limit]
    filename = f'{hash}.html'
    # 93fb17b5fb81b.html
    with open(filename, 'w') as f:
        f.write(content)
    # set url attribute
    os.setxattr(filename, 'user.url', url.encode())

and then you can retrieve the url attribute:

print(os.getxattr(filename, 'user.url').decode())
'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'

note: setxattr and getxattr require user. prefix in python
for file attributes in python see related issue here: https://stackoverflow.com/a/56399698/3737009

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
0

Using urllib.urlretrieve:

    import urllib

    testfile = urllib.URLopener()
    testfile.retrieve("http://example.com/here/there/index.html", "/tmp/index.txt")
Mercer
  • 9,736
  • 30
  • 105
  • 170
  • what I like to do, is to be able to refer back to the file that i've created, is it possible to change **/** to something like **\/** instead? – nafas Dec 02 '14 at 16:05
-1

May look into restricted charaters.

I would use a typical folder struture for this task. If you will use that with a lot of urls it will get somehow or other a mess. And you will run into filesystem performance issues or limits as well.

wenzul
  • 3,948
  • 2
  • 21
  • 33