1

I'm scraping property ads with BS4, using pandas for analysis. Using multiprocessing, I have the following,

def show_ad_prices(options):
    pool = Pool(options)

    page_link_list=[] # list of urls of pages with ads
    BS4_main(page_root_url) # BS4_main requests and parses url
    last_page_number=int(container.findAll("a", href=re.compile('^('+house_href+')((?!:).)*$'))[-2].text)
    for i in range(1,last_page_number):
       page_nr=page_root_url+'pagina-'+str(i)+'.htm'    
       page_link_list.append(page_nr)

    for page_link_url in page_link_list:
       overall_df=pd.DataFrame() 
       print(page_link_url)
       ad_page_urls = get_ad_page_urls(page_link_url) # returns all urls of ads on one page

       try:
           results = pool.map(get_ad_data, ad_page_urls) # gets data from ad 
       except Exception:
           print('error: '+page_link_url)
           continue

       try:
           df=pd.DataFrame.from_dict(results) # make DataFrame of data of all ads of one page
           print(df)
           overall_df.append(df) # append DataFrame to overall DataFrame
           print(total_df)
       except Exception: 
           print('error: '+page_link_url)

     return overall_df

My code successfully creates a dataframe of all ads on one page. print(df) prints such a "one-page" dataframe. However, when I try to append a one-page dataframe to the empty overall dataframe, nothing happens. The overall dataframe stays empty.

I've tried the answers to this question, but it doesn't seem to work. The could should be creating a one-page DataFrame and subsequently append this to the overall DataFrame.

Cœur
  • 37,241
  • 25
  • 195
  • 267
LucSpan
  • 1,831
  • 6
  • 31
  • 66
  • 1
    I believe the culprit is `for page_link_url in page_link_list: overall_df=pd.DataFrame()`. `overall_df` is initialized to an empty dataframe every iteration. Try to move `overall_df=pd.DataFrame()` outside the loop. Also keep in mind your issue might be that every process has it's own `overall_df` variable in memory. You'll need to post the rest of the code if that is the case. – DeepSpace Mar 24 '17 at 16:11
  • 1
    Also, `append` is not an inplace operation for DataFrames so you'll need to reassign, i.e. `overall_df = overall_df.append(df)`. – root Mar 24 '17 at 16:14
  • 1
    `.append()` does not happen in-place. You need to reassign to the data frame variable as @root indicated – James Mar 24 '17 at 16:16
  • Thanks so much guys, it's working! – LucSpan Mar 24 '17 at 16:25

0 Answers0