0

I have to remove urls from a file which has 404 status using python remove function. But I am not sure why it is not working.

Code:

#!/usr/bin/python

import requests



url_lines = open('url.txt').read().splitlines()
for url in url_lines:
    remove_url = requests.get(url)
    if remove_url.status_code == 404:
       print remove_url.status_code
       url_lines.remove(url)

url.txt file contains following lines:

https://www.amazon.co.uk/jksdkkhsdhk
http://www.google.com

Line https://www.amazon.co.uk/jksdkkhsdhk should be removed from url.txt file.

Thank you so much for help in advance.

rmstmg
  • 65
  • 8
  • maybe this will help https://stackoverflow.com/questions/6022764/python-removing-list-element-while-iterating-over-list – D. Seah May 20 '20 at 04:07
  • i think once the status code gets checked, you can just skip the status code check line afterwards and remove the url with `url_lines.remove(remove_url)` granted `url_lines` should be a list for that to happen. – de_classified May 20 '20 at 04:09
  • If you want the 404-urls to be dropped/removed from the `txt` file, you will need to write the updated list of valid urls to the file. – CypherX May 20 '20 at 04:15

1 Answers1

1

You could just skip it:

if remove_url.status_code == 404:
    continue

You shouldn't try to remove it while inside the for loop. Instead, add it to another list remove_from_urls and, after your for loop, remove all the indices in your new list. This could be done by:

remove_from_urls = []

for url in url_lines:
    remove_url = requests.get(url)
    if remove_url.status_code == 404:
        remove_from_urls.append(url)
        continue
    # Code for handling non-404 requests

url_lines = [url for url in url_lines if url not in remove_from_urls]

# Save urls example
with open('urls.txt', 'w+') as file:
    for item in url_lines:
        file.write(item + '\n')
beez
  • 43
  • 4
bitomic
  • 170
  • 9
  • Thanks for your help, I tried running your code but not effect. I still see 404 url in url.txt file. I think last line from your code which is out of for loop should update url.txt file, right? – rmstmg May 20 '20 at 04:27
  • @rmstmg you are right, I didn't update the file. You just need to store each item in `url_lines`. Will update the code :-) – bitomic May 20 '20 at 18:44