I am parsing local XML files with Scrapy, and the code seems to hang on one particular XML file.
The file may be too large (219M) or badly formatted? Either way the spider doesn't crash it just freezes. It freezes so bad I can't even ctrl+c out...
I have tried adjusting the DOWNLOAD_TIMEOUT
and DOWNLOAD_MAXSIZE
settings to get scrapy to skip this file, and any other similarly problematic files it encounters, but it doesn't seem to work. At least not if I use file:///Users/.../myfile.xml
as the URL, which I am doing based on this post.
If I instead start a server with python -m http.server 8002
and access the files through that URL (http://localhost:8002/.../myfile.xml
) then Scrapy does skip over the file with a cancelledError, like I want: expected response size larger than download max size
.
So I guess if you use the file protocol the downloader settings are not used, because you're not actually downloading anything? Something like that? Is there a way to tell scrapy to timeout/skip over local files?
It seems like launching an http server is one solution but it adds complexity to running the spider (and may slow things down?) so I'd rather find a different solution.