1

Only way I could think of only one way of solving this problem but it has some listed limitations. Can somebody suggest of another way of solving this problem?

we have given a text file with 999999 URLs. we have to write a python program to read this file & save all the webpages in a folder called 'saved_page'.

I have tried to solve this problem something like this,

import os
import urllib
save_path='C:/test/home_page/'
Name = os.path.join(save_path, "test.txt")
file = open('soop.txt', 'r')
'''all the urls are in the soop.txt file '''
for line in file:
    data = urllib.urlopen(line)
    for line in data:
        f=open(Name)
        lines=f.readlines()
        f.close()
        lines.append(line)
        f=open(Name,"w")
        f.writelines(lines)
        f.close()
file.close()

Here are some limitations with this code,

1).If network goes down, this code will restart.

2).If it comes across a bad URL - i.e. server doesn't respond - this code will be stuck.

3).I am currently downloading in sequence - this will be quite slow for large no of URLS.

So can somebody suggest a solution that would address these problems as well?

Bhupesh Pant
  • 4,053
  • 5
  • 45
  • 70
  • one thought, you can save the content in different files. So that when the program restarts, you can check the existance of file to determine whether to fetch the page again or not. – nu11p01n73R Oct 15 '14 at 04:06
  • K, I can do that but isn't that an overhead, I am thinking of maintaining the list of all the successfully downloaded urls and saving it into a file.. do you have any opinion about it? – Bhupesh Pant Oct 15 '14 at 04:20
  • if you are not interested in content there after maintaining a list would be a better option. – nu11p01n73R Oct 15 '14 at 04:24
  • 1
    are these urls point to the same site? Here's an example [how to make limited number of ssl connections at a time](http://stackoverflow.com/a/20722204/4279). – jfs Oct 15 '14 at 05:20
  • 1
    Have you looked at Scrapy pipelines? You could dump all the content into a JSON file (or files), e.g. http://doc.scrapy.org/en/latest/topics/item-pipeline.html ? Here's a similar example based on some recreational coding I did a couple of weeks ago: https://gist.github.com/alexwoolford/996f186c539f05ce1589 – Alex Woolford Oct 15 '14 at 06:05
  • they could be different @J.F.Sebastian – Bhupesh Pant Oct 17 '14 at 10:07
  • @AlexWoolford item pipelines seems more like a web crawler and I am basically looking for more simple solution. – Bhupesh Pant Oct 17 '14 at 10:16

1 Answers1

2

Some remarks :

Point 1 and 2 can easily be fixed by a restart point method. For in a in script restart, just do a loop until all is ok or max number of attemps under the line for line in file containing the read part and only write if you could successfully dowload file. You will still have to decide what to do in case or not downloadable file : either log an error and continue with next file or abort the whole job.

If you want to be able to restart later a failed job, you should keep somewhere (a state.txt file) the list of successfully downloaded files. You write (and flush) after each file got and written. But to be really boolet proof, you should write one element after getting the file, and one element after successfully writing it. That way, on restart, you can know is the output file may contain a partially written file (power outage, break, ...) by simply testing the presence of state file and its content.

Point 3 would be much more tricky. To allow parrallel download, you will have to use threads or asyncio. But you will also have to synchronize all that to ensure that the files are written in output file in proper order. If you can afford to keep everything in memory, a simple way would be to first download everything using parralelized method (the link given by J.F. Sebastian can help), and then write in order.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252