Naming multiple files in python and scrapy

Question

I'm trying to save files to a directory after scraping them from the web using scrapy. I'm extracting a date from the file and using that as the file name. The problem I'm running into, however, is that some files have the same date, i.e. there are two files that would take the name "June 2, 2009". So, what I'm looking to do is somehow check whether there is already a file with the same name, and if so, name it something like "June 2, 2009.1" or some such.

The code I'm using is as follows:

def parse_item(self, response):
    self.log('Hi, this is an item page! %s' % response.url) 

    response = response.replace(body=response.body.replace('<br />', '\n'))

    hxs = HtmlXPathSelector(response)

    date = hxs.select("//div[@id='content']").extract()[0]
    dateStrip = re.search(r"([A-Z]*|[A-z][a-z]+)\s\d*\d,\s[0-9]+", date) 
    newDate = dateStrip.group()


    content = hxs.select("//div[@id='content']") 
    content = content.select('string()').extract()[0]

    filename = ("/path/to/a/folder/ %s.txt") % (newDate) 


    with codecs.open(filename, 'w', encoding='utf-8') as output:
        output.write(content)

score 1 · Answer 1 · answered Apr 17 '12 at 11:01

1

You can use os.listdir to get a list of existing files and allocate a filename that will not cause conflict.

import os
def get_file_store_name(path, fname):
    count = 0
    for f in os.listdir(path):
        if fname in f:
            count += 1
    return os.path.join(path, fname+str(count))

# This is example to use 
print get_file_store_name(".", "README")+".txt"

answered Apr 17 '12 at 11:01

wuliang

749
5
7

（1) Taking into account that all the files are generated by the spider (using this function to allocate names), there is no chance to meet conflication.You have to make balance between completement and effeciency. (2) Checking A in B using my method, A in Left-Side-of B using your method, A in Right-Side of B using "f.endwith", my method is more general and less effeicent, but hardly to say it is wrong. – wuliang Apr 17 '12 at 20:49
What' matter it always return same name ? It just a index (suffix) to original name. Suppose you have download a file named README, and call this function to get a new name. The function just check the directory, if directory has already README0, README1, it wll get a name README2. – wuliang Apr 18 '12 at 07:54
Yes you are right, I'll delete previous comments, as well as this one in a few. I'm not familiar with "in" keyword to compare string. – FabienAndre Apr 18 '12 at 08:40
I like this better than mine, but have a some questions. Tried it out in a test directory that had three files named "test.txt" "test.1.txt" and "test.2.txt". Whenever I put test.txt in as the filename for the script (so print get_file_store_name("path", "test.txt)) it returns test.2.txt, which should be test.3.txt. Basically, what happens if the spider runs across 3+ pages that all need to be generated with the same name, just a different ending number? – user1074057 Apr 21 '12 at 21:05
And just to add a bit more to that above comment, the script would generate a name like "August 2, 2011" and using your code would check that against an existing directory. If it already exists it would try to name it "August 2, 2011.1". Let's say that's successful, but it runs across another "August 2, 2011" it would then try that name again and find that it exists, but at this point "August 2, 2011.1" already exists also, and so on. That seems to be a problem, or maybe I'm not understanding. – user1074057 Apr 21 '12 at 21:08
This code snippet assumes that all files(filenames) are generated by "get_file_store_name", the first one will get name "August 2, 2011.0" in your example. – wuliang Apr 23 '12 at 06:42

steveha · Answer 2 · 2012-04-16T03:20:13.720

The usual way to check for existence of a file in the C library is with a function called stat(). Python offers a thin wrapper around this function in the form of os.stat(). I suggest you use that.

http://docs.python.org/library/stat.html

def file_exists(fname):
    try:
        stat_info = os.stat(fname)
        if os.S_ISREG(stat_info): # true for regular file
            return True
    except Exception:
        pass
    return False

score 0 · Accepted Answer · edited May 23 '17 at 11:49

The other answer pointed me in the correct direction by checking into the os tools in python, but I think the way I found is perhaps more straightforward. Reference here How do I check whether a file exists using Python? for more.

The following is the code I came up with:

    existence = os.path.isfile(filename)

    if existence == False:
        with codecs.open(filename, 'w', encoding='utf-8') as output:
            output.write(content)
    else:
        newFilename = ("/path/.../.../- " + '%s' ".1.txt") % (newDate)
        with codecs.open(newFilename, 'w', encoding='utf-8') as output:
            output.write(content)

Edited to Add:

I didn't like this solution too much, and thought the other answer's solution was probably better but didn't quite work. The main part I didn't like about my solution was that it only worked with 2 files of the same name; if three or four files had the same name the initial problem would occur. The following is what I came up with:

filename = ("/Users/path/" + " " + "title " + '%s' + " " + "-1.txt") % (date) 
filename = str(filename)

    while True:
        os.path.isfile(filename)
        newName = filename.replace(".txt", "", filename)
        newName = str.split(newName)
        newName[-1] = str(int(newName[-1]) + 1)
        filename = " ".join(newName) + ".txt"
        if os.path.isfile(filename) == False:
            with codecs.open(filename, 'w', encoding='utf-8') as output:
                output.write(texts)
            break

It probably isn't the most elegant and might be kind of a hackish approach, but it has worked so far and seems to have addressed my problem.

This is fine so long as your python script is the only process creating these files. If you're working in parallel with another program there's a [subtle race condition] (http://stackoverflow.com/questions/82831/how-do-i-check-if-a-file-exists-using-python) you may need to account for. — Li-aung Yip, Apr 17 '12 at 07:00

score 0 · Answer 4 · answered Apr 17 '12 at 06:55

0

one other solution is you can append time with date, for naming file like

from datetime import datetime

filename = ("/path/to/a/folder/ %s_%s.txt") % (newDate,datetime.now().strftime("%H%M%S"))

answered Apr 17 '12 at 06:55

akhter wahab

4,045
1
25
47

Naming multiple files in python and scrapy

4 Answers4