0

I need my code to have a new link so that it could parse and print a list of new titles from articles.

How to loop part of code so that it iterated over options from 1 to 10 switching to a new link each time?

import urllib.request
from bs4 import BeautifulSoup
import re

for n in range(1,11):
        url = f"https://habr.com/ru/articles/{n}/"

fp = urllib.request.urlopen(url)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
def title(input_string):
    
        pattern = r'<title>(.+?)</title>'
        
        match = re.search(pattern, input_string)
        if match:
            return match.group(1)
        else:
            return None
print(title(mystr))

3 Answers3

1

There were a few issues in your code:

  1. Added the required parts of your code back into your for loop
  2. Added some error handling (try, except, finally) around your URL requests to deal with page-not-found errors (and anything else that may go wrong, it will catch ANY error and skip it)

Give this a try:

import urllib.request
from bs4 import BeautifulSoup
import re

def title(input_string):  
    pattern = r'<title>(.+?)</title>'
    
    match = re.search(pattern, input_string)
    if match:
        return match.group(1)
    else:
        return None

for n in range(1,11):
    url = f"https://habr.com/ru/articles/{n}/"

    try:
        fp = urllib.request.urlopen(url)
        mybytes = fp.read()
        mystr = mybytes.decode("utf8")
        print(title(mystr))
    except:
        pass
    finally:
        fp.close()
Pep_8_Guardiola
  • 5,002
  • 1
  • 24
  • 35
0

Consider pushing most of your code into the title function so that the function does the urllib.request as well as the pattern search. Then make url the parameter for that function. Lastly, loop the function call.

import urllib.request
from bs4 import BeautifulSoup
import re

def title(url):
        fp = urllib.request.urlopen(url)
        mybytes = fp.read()
        mystr = mybytes.decode("utf8")
        fp.close()

        pattern = r'<title>(.+?)</title>'
        
        match = re.search(pattern, input_string)
        if match:
            return match.group(1)
        else:
            return None

for n in range(1,11):
        url = f"https://habr.com/ru/articles/{n}/"
        print(title(url))
JNevill
  • 46,980
  • 4
  • 38
  • 63
-1

You can save the variable part of the link (in this case n) in a list.

def title(input_string):
    pattern = r'<title>(.+?)</title>'
    
    match = re.search(pattern, input_string)
    if match:
        return match.group(1)
    else:
        return None

url_list = ['option1', 'option2', ...]

for i in len(range(10)): 
    url = f"https://habr.com/ru/articles/{url_list[i]}/"
    fp = urllib.request.urlopen(url)
    mybytes = fp.read()
    mystr = mybytes.decode("utf8")
    fp.close()

    print(title(mystring))

In this case, you should move the urllib request inside and the print statement inside the loop so that it runs for every variation of the url. And by defining your titles function outside of the loop, you avoid initializing it every time the loop runs.

Also, keep in mind that every list is indexed starting 0, so if you keep the range at range(1, 11, 1) you will lose out on the first element of the list, and you will get an error since there is no index 11 for a list with 10 elements.