1

I am scrapping England's Joint Data and have the results in the correct format I want when I do one hospital at a time. I eventually want to iterate over all hospitals but first decided to make an array of three different hospitals and figure out the iteration.

The code below gives me the correct format of the final results in a pandas DataFrame when I have just one hospital:

import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
r=requests.get("http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?
hospitalName=Norfolk%20and%20Norwich%20Hospital")
c=r.content
soup=BeautifulSoup(c,"html.parser")

all=soup.find_all(["div"],{"class":"toggle_container"})[1]

i=0
temp = []
for item in all.find_all("td"):
    if i%4 ==0:
        temp.append(soup.find_all("span")[4].text)
        temp.append(soup.find_all("h5")[0].text)
    temp.append(all.find_all("td")[i].text.replace("   ",""))
    i=i+1
table = np.array(temp).reshape(12,6)
final = pandas.DataFrame(table)
final

In my iterated version, I cannot figure out a way to append each result set into a final DataFrame:

hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
            "http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
            "http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
    r=requests.get(item)
    c=r.content
    soup=BeautifulSoup(c,"html.parser")

    all=soup.find_all(["div"],{"class":"toggle_container"})[1]
    i=0
    temp = []
    for item in all.find_all("td"):
        if i%4 ==0:
            temp.append(soup.find_all("span")[4].text)
            temp.append(soup.find_all("h5")[0].text)
        temp.append(all.find_all("td")[i].text)
        i=i+1
    table = np.array(temp).reshape((int(len(temp)/6)),6)
    temp2.append(table)
    #df_final = pandas.DataFrame(df)

At the end, the 'table' has all the data I want but its not easy to manipulate so I want to put it in a DataFrame. However, I am getting an "ValueError: Must pass 2-d input" error.

I think this error is saying that I have 3 arrays which would make it 3 dimensional. This is just a practice iteration, there are over 400 hospitals whose data I plan to put into a dataframe but I am stuck here now.

CandleWax
  • 2,159
  • 2
  • 28
  • 46

2 Answers2

1

The simple answer to your question would be HERE.

The tough part was taking your code and finding what was not right yet.

Using your full code, I modified it as shown below. Please copy and diff with yours.

import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np

hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
            "http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
            "http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
    r=requests.get(item)
    c=r.content
    soup=BeautifulSoup(c,"html.parser")

    all=soup.find_all(["div"],{"class":"toggle_container"})[1]
    i=0
    temp = []
    for item in all.find_all("td"):
        if i%4 ==0:
            temp.append(soup.find_all("span")[4].text)
            temp.append(soup.find_all("h5")[0].text)
        temp.append(all.find_all("td")[i].text)
        i=i+1
    table = np.array(temp).reshape((int(len(temp)/6)),6)
    for array in table:
        newArray = []
        for x in array:
            try:
                x = x.encode("ascii")
            except:
                x = 'cannot convert'
            newArray.append(x)
        temp2.append(newArray)

df_final = pandas.DataFrame(temp2, columns=['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
print df_final

I tried to use a list comprehension for the ascii conversion, which was absolutely necessary for the strings to show up in the dataframe, but the comprehension was throwing an error, so I built in an exception, and the exception never shows.

Thom Ives
  • 3,642
  • 3
  • 30
  • 29
1

I reorganized the code a little and was able to create the dataframe without having to encode.

Solution:

hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
            "http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
            "http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp = []
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
    r=requests.get(item)
    c=r.content
    soup=BeautifulSoup(c,"html.parser")

    all=soup.find_all(["div"],{"class":"toggle_container"})[1]
    i=0

    for item in all.find_all("td"):
        if i%4 ==0:
            temp.append(soup.find_all("span")[4].text)
            temp.append(soup.find_all("h5")[0].text)
        temp.append(all.find_all("td")[i].text.replace("-","NaN").replace("+",""))
        i=i+1
temp2.append(temp)
table = np.array(temp2).reshape((int(len(temp2[0])/6)),6)
df_final = pandas.DataFrame(table, columns=['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
df_final
CandleWax
  • 2,159
  • 2
  • 28
  • 46