2

I'm trying to create a Pandas DataFrame by iterating through data in a soup (from BeautifulSoup4). This SO post suggested using the .loc method to Set With Englargement to create a DataFrame.

However this method takes a long time to run (around 8 minutes for a df of 30,000 rows and 5 columns). Is there any quicker way of doing this. Here's my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://api.turfgame.com/v3/zones"
r = requests.get(url)
soup = BeautifulSoup(r.content)

col_names = ["name", "lat", "lng", "points_take", "points_hold"]
dfi = pd.DataFrame(columns=col_names)

def get_all_zones():

    for attr in soup.find_all("zone"):
        col_values= [attr.get("name"), attr.get("lat"), attr.get("lng"), attr.get("points_take"), attr.get("points_hold")]
        pos = len(dfi.index)
        dfi.loc[pos] = col_values

    return dfi

get_all_zones()
Community
  • 1
  • 1
Jason
  • 4,346
  • 10
  • 49
  • 75
  • 1
    use a dictionary where keys are column names, and values are the columns as lists (no series or frames). once, you have everything, pass the dictionary to `pd.DataFrame` – behzad.nouri Sep 15 '14 at 00:51

1 Answers1

4

Avoid

df.loc[pos] = row

whenever possible. Pandas NDFrames store the underlying data in blocks (of common dtype) which (for DataFrames) are associated with columns. DataFrames are column-based data structures, not row-based data structures.

To access a row, the DataFrame must access each block, pick out the appropriate row and copy the data into a new Series.

Adding a row to an existing DataFrame is also slow, since a new row must be appended to each block, and new data copied into the new row. Even worse, the data block has to be contiguous in memory. So adding a new row may force Pandas (or NumPy) to allocate a whole new array for the block and all the data for that block has to be copied into a larger array just to accomodate that one row. All that copying makes things very slow. So avoid it if possible.

The solution in this case is to append the data to a Python list and create the DataFrame in one fell swoop at the end:


import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://api.turfgame.com/v3/zones"
r = requests.get(url)
soup = BeautifulSoup(r.content)

col_names = ["name", "lat", "lng", "points_take", "points_hold"]
data = []    

def get_all_zones():    
    for attr in soup.find_all("zone"):
        col_values = [attr.get("name"), attr.get("lat"), attr.get(
            "lng"), attr.get("points_take"), attr.get("points_hold")]
        data.append(col_values)
    dfi = pd.DataFrame(data, columns=col_names)

    return dfi

dfi = get_all_zones()
print(dfi)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677