2

I have a file of size 10 GB. This file mostly contains URLs. I am trying to get the HTTP status code of each URL and store them into another File with .CSV extension.
I have searched for a code and found a solution to access the status code of a URL using Python:

import requests
request = requests.get('http://www.example.com')
print(request.status_code)

But it takes on one URL. I have a file of a larger size. I do not know how I can input URLs from a file to this command. Even how to store the output in .CSV format
Even it is not faster. I am looking for a faster solution that will give me faster result for 10 GB file.
I tried the Ubuntu command also:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective},%{http_code}\n' < Input_File.txt > output.CSV

But it is also not multi threaded. It is taking single line at a time and then storing to CSV.
So, my question is how I can make this work faster for a file size of 10 GB. If there is any solution for this in any programming language, I will be happy to implement.
Here is the sample file of URL - a small chunk from my 10 GB file:
https://drive.google.com/file/d/0BzQ6rtO2VN95c0YzclhySVZYNDQ/view?usp=sharing
I want to store the output in CSV as:

URL,Http Status code

For example:

http://google.com,200  
http://example.com,503  

Hope this help to understand my query.

Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
  • If your machine is powerful enough, try multi-processing in Python. Set a public queue of urls, and then establish N processes getting urls from the public queue. – Acepcs Jan 05 '17 at 08:03
  • @Acepcs Thank you for your advise. Still my friend file size is 10 GB. or even more in some time. How many processes and how many splitting you think I should do. I am looking for a all in one solution to execute any file size faster.I do not want the program to take days to make the work done. few hours is what I want. – Jaffer Wilson Jan 05 '17 at 08:06
  • [This might supply some pointers](https://stackoverflow.com/questions/6441509/how-to-write-a-process-pool-bash-shell) if you don't want to dabble in Python. – Ken Y-N Jan 05 '17 at 08:07
  • @Jaffer Wilson 10GB is a big number, and if this is a one-shot mission, you can divide the whole file into several parts and use several computers to execute multi-processing python scripts, I am sure the time cost is acceptable. But if it is a long-term mission, my suggestion would be distributed system. – Acepcs Jan 05 '17 at 08:13

1 Answers1

1

What curl can do, python requests can often do, and do better. like curl, it also has a HEAD method.

import requests
response = requests.head('http://www.example.com')
print(response.status_code)
e4c5
  • 52,766
  • 11
  • 101
  • 134
  • Yes you are right but the thing is I do not know how to pass multiple URLs at time to process and give output to the Python or Curl command in ubuntu. I have tried many things but have the same problem. Single URl at a time. I want to feed multiple URLs say 100 at a time.. my file size is 10 GB.... – Jaffer Wilson Jan 05 '17 at 08:03
  • using multiple urls at a time time will be faster only if they are on the same server. other wise the connection establishment overhead would still be there. And you would after processing all your 10GB probably save about 30 seconds – e4c5 Jan 05 '17 at 08:05
  • I have bunch of URLs and just need to check if they are alive or not. They are not on the same server. They are from different location also. – Jaffer Wilson Jan 05 '17 at 08:09
  • then as in my previous comment sending a bunch of urls to python-requests or curl will not make any difference – e4c5 Jan 05 '17 at 08:12
  • what you really need is a distributed system at best or at least multiple threads making requests in parrallel – e4c5 Jan 05 '17 at 08:58