6

Can someone tell me a way to add data into pandas dataframe in python while multiple threads are going to use a function in which data has to be appended into a dataframe...?

My code scrapes data from a URL and then i was using df.loc[index]... to add the scrapped row into the dataframe.

Since I've started a multi thread which basically assigns each URL to each thread. So in short many pages are being scraped at once...

How do I append those rows into the dataframe?

Yasir Azeem
  • 298
  • 2
  • 4
  • 13

1 Answers1

7

Adding rows to dataframes one-by-one is not recommended. I suggest you build your data in lists, then combine those lists at the end, and then only call the DataFrame constructor once at the end on the full data set.

Example:

# help from http://stackoverflow.com/a/28463266/3393459
# and http://stackoverflow.com/a/2846697/3393459


from multiprocessing.dummy import Pool as ThreadPool 
import requests
import pandas as pd


pool = ThreadPool(4) 

# called by each thread
def get_web_data(url):
    return {'col1': 'something', 'request_data': requests.get(url).text}


urls = ["http://google.com", "http://yahoo.com"]
results = pool.map(get_web_data, urls)


print results
print pd.DataFrame(results)
exp1orer
  • 11,481
  • 7
  • 38
  • 51
  • Thank you.. That's an idea for sure. How do I manage a workaround to index each list? Since any thread can generate any list name at any time. So giving an index to start with and then increase it one by one may not be the right choice... – Yasir Azeem Dec 02 '16 at 18:55
  • 1
    Not sure what you mean. I posted example code so we can talk more concretely. When multiprocessing my understanding is you can't have any guarantees about the order in which results come back... If you want to post your code that might also be helpful. – exp1orer Dec 02 '16 at 19:09
  • 1
    I just took your list advice and just appended all the data into a list and then finally transferred it to pandas dataframe and it worked perfectly for my case! Thanks a lot :) – Yasir Azeem Dec 02 '16 at 19:26