0

I am parsing local XML files with Scrapy, and the code seems to hang on one particular XML file.

The file may be too large (219M) or badly formatted? Either way the spider doesn't crash it just freezes. It freezes so bad I can't even ctrl+c out...

I have tried adjusting the DOWNLOAD_TIMEOUT and DOWNLOAD_MAXSIZE settings to get scrapy to skip this file, and any other similarly problematic files it encounters, but it doesn't seem to work. At least not if I use file:///Users/.../myfile.xml as the URL, which I am doing based on this post.

If I instead start a server with python -m http.server 8002 and access the files through that URL (http://localhost:8002/.../myfile.xml) then Scrapy does skip over the file with a cancelledError, like I want: expected response size larger than download max size.

So I guess if you use the file protocol the downloader settings are not used, because you're not actually downloading anything? Something like that? Is there a way to tell scrapy to timeout/skip over local files?

It seems like launching an http server is one solution but it adds complexity to running the spider (and may slow things down?) so I'd rather find a different solution.

Dustin Michels
  • 2,951
  • 2
  • 19
  • 31

1 Answers1

1

I'm fairly certain that DOWNLOAD_TIMEOUT and DOWNLOAD_MAXSIZE work only when making calls via HTTP or another network protocols. Rather, you could override the start_requests method where you would have more control over how you read the files:

def start_requests(self, **kwargs):
  for uri in self.uris:
    ...

You could, for example, use os.read with providing the _length parameter which would tell Python to read the file until _length amount of bytes have been read, and then return. This would possibly have the same effect as if you would use DOWNLOAD_MAXSIZE.

Krisz
  • 1,884
  • 12
  • 17
  • Nice point, thank you. Do you suspect the size is the issue? Or could there be another reason scrapy freezes on certain XML files? – Dustin Michels Mar 19 '21 at 09:46
  • 1
    Given that you already tried it via HTTP and got the error `expected response size larger than download max size` I suspect it's the file size, yes. Note that reading a file is a blocking method and `os.read` will try to read the whole file in one go before it returns, which might explain why it freezes. – Krisz Mar 19 '21 at 09:50