-1

i have a 900 rows csv file contain url links , I want to detect which one of them are broken or 404 dead links before I can use scikit-learn . so is there any way I could use python 3.7 and be able to generate a csv file that tells which one is a dead link and which one is active .

hope to find anyone who could help me with that and thanks in advance

Mostafa Gafer
  • 63
  • 3
  • 14
  • 1
    Please edit your post with your program, your exact issue, what you have tried, and why it has not worked to your standards. – miike3459 Oct 11 '18 at 23:14
  • 1
    Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [On topic](http://stackoverflow.com/help/on-topic), [how to ask](http://stackoverflow.com/help/how-to-ask), and [... the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. StackOverflow is not a design, coding, research, or tutorial resource. – Prune Oct 11 '18 at 23:18

1 Answers1

1

You will need to specify what it means for a link to be broken exhaustively. Here is a sample code, you can tweak it to your need by updating the is_broken method:

 import pandas as pd
 import requests

 # Preparing dummy data 
 links = ['https://google.com', 'http://thisisinvalid.de', 'http://docs.python-requests.org/en/master/api/broken']
 df = pd.DataFrame(links, columns=['links'])

 # Update as you need
 def is_broken(link):
     try:
         response = requests.get(link)
         if response.status_code == 404:
             return True
         return False
     except Exception as e:
         return True

 df.ix[:, 'is_broken'] = df.ix[:, 'links'].map(lambda link: is_broken(link))

https://google.com is not broken, http://thisisinvalid.de cannot resolve and http://docs.python-requests.org/en/master/api/broken returns 404

maininformer
  • 967
  • 2
  • 17
  • 31
  • thanks for your beloved fast respond , I truly appreciate it , you are the greatest , but should it take a long time to load ?? it is taking a very long time to deal with my data ? – Mostafa Gafer Oct 12 '18 at 00:50
  • Yes, it makes 900 blocking network calls. You could try asynchronous calls, here is an example: https://stackoverflow.com/questions/9110593/asynchronous-requests-with-python-requests – maininformer Oct 12 '18 at 00:58