0

I am trying to create a dataframe from three lists which I have generated using webscraped data. However, when I try and turn these lists into dictionaries and then use them to build my pandas dataframe it outputs a dataframe for each dictionary item (row) rather than one dataframe including all of these items as rows within the dataframe.

I believe the issue lies in the for loop that I have used to webscrape the data. I know similar questions have been asked on this one, including here Pandas DataFrame created for each row and here Take multiple lists into dataframe but I have tried the solutions without any joy. I believe the webscrape loop adds a nuance that makes this more tricky.

Step by step walkthrough of my code and the output are below, for reference I have imported pandas as pd and bs4.

    # Step 1 create a webscraper which takes three sets of data (price, bedrooms and bathrooms) from a website and populate into three separate lists

for container in containers:
    try:
        price_container=container.find("a",{"class":"listing-price text-price"})
        price_strip=price_container.text.strip()
        price_list=[]
        price_list.append(price_strip)

    except TypeError:
        continue

    try:
        bedroom_container = container.find("span",{"class":"icon num-beds"})
        bedroom_strip=(bedroom_container["title"])
        bedroom_list=[]
        bedroom_list.append(bedroom_strip)
    
     except TypeError:
        continue

    try:
        bathroom_container=container.find("span", {"class":"icon num-baths"})
        bathroom_strip=(bathroom_container["title"])
        bathroom_list=[]
        bathroom_list.append(bathroom_strip)
    
    except TypeError:
        continue

# Step 2 create a dictionary 

    data = {'price':price_list, 'bedrooms':bedroom_list, 'bathrooms':bathrooms_list}


# Step 3 turn it into a pandas dataframe and print the output

    d=pd.DataFrame(data)
    print(d)    

This gives me a dataframe for each dictionary as below.

   price               bedrooms          bathrooms                                   
0  £200,000            3                 2

[1 rows x 3 columns]
  

   price               bedrooms          bathrooms                                   
0  £400,000            5                 3

[1 rows x 3 columns]


   prices              bedrooms          bathrooms                                   
0  £900,000            6                 4

[1 rows x 3 columns]

and so on.....

I've tried dictionary comprehension and list comprehension, to give me one dataframe rather than a dataframe for each dictionary item:

data = [({'price':price, 'bedrooms':bedrooms, 'bathrooms':bathrooms}) for item in container]

df = pd.DataFrame(data)

print(df)

and, despite how I do the list expression, this yields an even weirder output. It gives me a dataframe for each item in the dictionary with the same row of information repeated a number of times

   price               bedrooms          bathrooms                                  
0  £200,000            3                 2
0  £200,000            3                 2
0  £200,000            3                 2

[3 rows x 3 columns]
  

   price               bedrooms          bathrooms                                   
0  £400,000            5                 3
0  £400,000            5                 3
0  £400,000            5                 3

[3 rows x 3 columns]


   price               bedrooms          bathrooms                                   
0  £900,000            6                 4
0  £900,000            6                 4
0  £900,000            6                 4
[1 rows x 3 columns]

and so on...

How do I resolve this problem and get all of my data into one pandas dataframe?

goodaytar
  • 75
  • 5

3 Answers3

1

Firstly you should do price_list=[] and bedroom_list=[] and bathroom_list=[] before your for loop - otherwise they were 1-element long at most as it in every turn they would be reseted to [] then appended with single element. Secondly if you wish to have single dataframe you should create it outside for loop i.e. dedent data = {'price':price_list, 'bedrooms':bedroom_list, 'bathrooms':bathrooms_list} and following lines. Finally in case of missing data you should denote it - if any but first continue will be executed your price_list, bedroom_list, bathroom_list will have different lengths. I suggest replacing first continue using price_list.append(None) second using bedroom_list.append(None) third using bathroom_list.append(None), so you would have clear indication in your dataframe where data is missing.

Daweo
  • 31,313
  • 3
  • 12
  • 25
0

The code part you're testing here is good- a dictionary of lists will always return a single dataframe. So this part:

pd.DataFrame(data)

can't be the cause of the problem. Instead, it's the fact that it's buried inside a loop, so is running three times. The same goes for your lists which are being defined over and over again.

Take those parts out of the loop, and you should be ok.

houseofleft
  • 347
  • 1
  • 12
0

You have to merge the three lists

df = pd.DataFrame(data["price"] + data["bedrooms"] + data["bathrooms"] )

if you want something more generic :

list_ = [item for i in data for item in data[i]]
df = pd.DataFrame(list_)
AlexisG
  • 2,476
  • 3
  • 11
  • 25