-1

so my problem is that I'm not really into programming, the only thing I do is making websites and selling them.

I learned a little bit of python but not much and here is where my problem comes in. I startet a program because I want to learn the language while it is usefull... So as I said early I sell websites and there is a website in my country where almost every company is listed. I want a scraper that looks for all the Number on the website.

Currently it only works for the first number but every on every page only ten are listed, here is my code:

from requests import get

def starting():

keyword = input("Suchbegriff: ")
URL = "https://www.herold.at/gelbe-seiten/" + keyword + "/"
print("Targing... : " + URL)
data = get(URL)
print(data.text[:100000000000000000000000])
    
tel = data.text.find('"tel:')

print(tel)
print(data.text[tel:tel + 19])

starting()

Currently if I enter a branche niche name like "friseur" i get only the first number as output:

"39820 "tel:+4315124367" t"

How can I make it that the crawler continues and gets the other 9.

Already thanks for your anwsers!

Leander
  • 46
  • 5
  • find is a method of the string class which return only the 1st occurrence of what you are looking for – cards Sep 04 '21 at 20:50
  • so what can i do, that it catches all 10? – Leander Sep 04 '21 at 20:53
  • maybe smt like `for line in data.text.split('\n'): if line.find('"tel:') > -1: ...` otherwise you can use module such `bs4` to scrap the content of the page or use regex – cards Sep 04 '21 at 20:53
  • okay thanks, I'll try bs4 – Leander Sep 04 '21 at 20:59
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Sep 09 '21 at 02:08

1 Answers1

0

To get the number you can use the build-in module re (which stands for regex shorthand for regular expressions). The parameter flags=re.M stands for multi-line, so it applies the pattern to the full string, without only till the end of the line.

import re
import requests

url = # see above
response = request.get(url)

tel_nrs = re.search(r'(\+\d+)', response.text, flags=re.M)

print(tel_nrs.groups())

Output

+4315124367
...

Remark: also if you use bs4 you will have to face this problem, bs4 is useful to navigate the page

Together with bs4 could be

from bs4 import BeautifulSoup
import re
import requests

url = # see above
response = request.get(url)

# make the response a "navigable" object
soup = BeautifulSoup(response.text, 'lxml')

# regex pattern for the tel nr
n_teL_pattern = re.compile(r'(\+\d+)')

# look for all string in the soup which satisfy the pattern
for s in soup.find_all(string=n_teL_pattern):
    print(n_teL_pattern.search(s).group())   # print the match
cards
  • 3,936
  • 1
  • 7
  • 25
  • `lxml` is the parser that I normally used but it is not from the standard library. There is one from the standard library `html.parser`, so no extra installation. check the docs for more details https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser – cards Sep 04 '21 at 21:37