2

I am working on a web scraping project and have run into the following error.

requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

Below is my code. I retrieve all of the links from the html table and they print out as expected. But when I try to loop through them (links) with request.get I get the error above.

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:
        page = requests.get(link)
        soup = BeautifulSoup(page.content, 'html.parser')
        table = []
        # Find all the divs we need in one go.
        divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
        for div in divs:
            # find all the enclosing a tags.
            anchors = div.find_all('a')
            for anchor in anchors:
                # Now we have groups of 3 list items (li) tags
                lis = anchor.find_all('li')
                # we clean up the text from the group of 3 li tags and add them as a list to our table list.
                table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
        # We have all the data so we add it to a DataFrame.
        headers = ['Number', 'Tenant', 'Square Footage']
        df = DataFrame(table, columns=headers)
        print (df)
furas
  • 134,197
  • 12
  • 106
  • 148
snappers
  • 43
  • 1
  • 2
  • 4
  • 1
    always put full error message (Traceback) in question (as text, not screenshot). There are other useful informations. For example It shows which line makes problem. – furas Dec 20 '17 at 03:32
  • your mistake is double `for` loop - use print to display values in variables and you will see what silly mistake you made. – furas Dec 20 '17 at 03:34
  • as per my understanding there isn't any error, they are just not getting what they what to be exact right? – P.hunter Dec 20 '17 at 03:34
  • @P.hunter The question indicates `requests.exceptions.MissingSchema`. – Galen Dec 20 '17 at 03:35
  • yea i got it thanks – P.hunter Dec 20 '17 at 03:36
  • btw: pandas can read tables directly from web pages - `all_tables = pandas.read_html(url) ; df = all_tables[0]` – furas Dec 20 '17 at 03:53

1 Answers1

5

Your mistake is second for loop in code

for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:

ref['href'] gives you single url but you use it as list in next for loop.

So you have

for link in ref['href']:

and it gives you first char from url http://properties.kimcore... which is h

Full working code

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    link = ref['href']
    print(link)
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = []
    # Find all the divs we need in one go.
    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        # find all the enclosing a tags.
        anchors = div.find_all('a')
        for anchor in anchors:
            # Now we have groups of 3 list items (li) tags
            lis = anchor.find_all('li')
            # we clean up the text from the group of 3 li tags and add them as a list to our table list.
            table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
    # We have all the data so we add it to a DataFrame.
    headers = ['Number', 'Tenant', 'Square Footage']
    df = DataFrame(table, columns=headers)
    print (df)

BTW: if you use comma in (ref['href'], ) then you get tuple and then second for works correclty.


EDIT: it create list table_data at start and add all data into this list. And it convert into DataFrame at the end.

But now I see it read the same page few times - because in every row the same url is in every column. You would have to get url only from one column.

EDIT: now it doesn't read the same url many times

EDIT: now it get text and hre from first link and add to every element in list when you use append().

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table_data = []

# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows: 

    # link in first column (td[0]
    #link = row.select('td')[0].find('a')
    link = row.find('a')

    link_href = link['href']
    link_text = link.text

    print('text:', link_text)
    print('href:', link_href)

    page = requests.get(link_href)
    soup = BeautifulSoup(page.content, 'html.parser')

    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        anchors = div.find_all('a')
        for anchor in anchors:
            lis = anchor.find_all('li')
            item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
            item2 = lis[1].text
            item3 = lis[2].text.strip()
            table_data.append([item1, item2, item3, link_text, link_href])

    print('table_data size:', len(table_data))            

headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)
furas
  • 134,197
  • 12
  • 106
  • 148