requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied

Question

I am working on a web scraping project and have run into the following error.

requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

Below is my code. I retrieve all of the links from the html table and they print out as expected. But when I try to loop through them (links) with request.get I get the error above.

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:
        page = requests.get(link)
        soup = BeautifulSoup(page.content, 'html.parser')
        table = []
        # Find all the divs we need in one go.
        divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
        for div in divs:
            # find all the enclosing a tags.
            anchors = div.find_all('a')
            for anchor in anchors:
                # Now we have groups of 3 list items (li) tags
                lis = anchor.find_all('li')
                # we clean up the text from the group of 3 li tags and add them as a list to our table list.
                table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
        # We have all the data so we add it to a DataFrame.
        headers = ['Number', 'Tenant', 'Square Footage']
        df = DataFrame(table, columns=headers)
        print (df)

always put full error message (Traceback) in question (as text, not screenshot). There are other useful informations. For example It shows which line makes problem. — furas, Dec 20 '17 at 03:32
your mistake is double `for` loop - use print to display values in variables and you will see what silly mistake you made. — furas, Dec 20 '17 at 03:34
as per my understanding there isn't any error, they are just not getting what they what to be exact right? — P.hunter, Dec 20 '17 at 03:34
@P.hunter The question indicates `requests.exceptions.MissingSchema`. — Galen, Dec 20 '17 at 03:35
btw: pandas can read tables directly from web pages - `all_tables = pandas.read_html(url) ; df = all_tables[0]` — furas, Dec 20 '17 at 03:53

furas · Accepted Answer · 2017-12-21T18:32:49.480

Your mistake is second for loop in code

for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:

ref['href'] gives you single url but you use it as list in next for loop.

So you have

for link in ref['href']:

and it gives you first char from url http://properties.kimcore... which is h

Full working code

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    link = ref['href']
    print(link)
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = []
    # Find all the divs we need in one go.
    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        # find all the enclosing a tags.
        anchors = div.find_all('a')
        for anchor in anchors:
            # Now we have groups of 3 list items (li) tags
            lis = anchor.find_all('li')
            # we clean up the text from the group of 3 li tags and add them as a list to our table list.
            table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
    # We have all the data so we add it to a DataFrame.
    headers = ['Number', 'Tenant', 'Square Footage']
    df = DataFrame(table, columns=headers)
    print (df)

BTW: if you use comma in (ref['href'], ) then you get tuple and then second for works correclty.

EDIT: it create list table_data at start and add all data into this list. And it convert into DataFrame at the end.

But now I see it read the same page few times - because in every row the same url is in every column. You would have to get url only from one column.

EDIT: now it doesn't read the same url many times

EDIT: now it get text and hre from first link and add to every element in list when you use append().

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table_data = []

# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows: 

    # link in first column (td[0]
    #link = row.select('td')[0].find('a')
    link = row.find('a')

    link_href = link['href']
    link_text = link.text

    print('text:', link_text)
    print('href:', link_href)

    page = requests.get(link_href)
    soup = BeautifulSoup(page.content, 'html.parser')

    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        anchors = div.find_all('a')
        for anchor in anchors:
            lis = anchor.find_all('li')
            item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
            item2 = lis[1].text
            item3 = lis[2].text.strip()
            table_data.append([item1, item2, item3, link_text, link_href])

    print('table_data size:', len(table_data))            

headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)

Yes. [Parenthesis do not make a tuple](https://stackoverflow.com/a/12876194/8955448). — Galen, Dec 20 '17 at 03:45
Now I need to get all of the data frame outputs into a single df — snappers, Dec 20 '17 at 21:24
you can create many `df` and use [merge, join, or concatenate](https://pandas.pydata.org/pandas-docs/stable/merging.html) to create one `df` with all data. — furas, Dec 20 '17 at 21:27
you can also create single `df` at start and [append()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html) new rows in loop. — furas, Dec 20 '17 at 21:29
you can also create single list at start and `append()` data to list, and at the end convert single list into single `df` — furas, Dec 20 '17 at 21:31
Thanks so much, this is really helping me learn. Is it also possible to add the property name to each list as a column? the property name is contained between each 'a' tag along with the first href value we scraped — snappers, Dec 21 '17 at 17:44
yes, you can get text from ` when you get `href` and later add to list when you do `append()`. See new code. — furas, Dec 21 '17 at 18:33

requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied

1 Answers1

Linked