1

So I have a class that formats a csv using pandas data frames. The amount of data in the csv's is very large and I would like to introduce multithreading into the program in order to speed up the process. The issue I am running into is that the Data frame's values don't get altered when I use multithreading, any tips?

Method that adds data to the dataframe and tracks progress:

    def add_data(self, x, whitelist, time_zone, progress_bar, info_panel):
        y, mo, d, h, mi, s = self.parseDateTime(x['date']) 
        
        date = (dt.datetime(y, mo, d, h, mi) + dt.timedelta(hours=self.time_zones[self.time_zone]))
        
        if date >= self.DST_start and date < self.DST_end:
            date += self.DST_diff
            
        date = date.strftime("%m/%d/%Y %I:%M %p")
        
        key = x['keys']
        val = x['val']
        
        self.current_entry += 1
        progress_bar['value'] = (self.current_entry/self.total_entries) *100
        
        info_panel.update_idletasks()
        info_panel.update()

        if (val != 'NaN'):
            if(key in whitelist):
                try:
                    temp = float(val)
                    if self.dfOut.isna()[key][date]:
                        self.dfOut[key][date] = temp
                    else:
                        self.dfOut[key][date] += temp
                    self.dfAvg[key][date] += 1
                except ValueError:
                    self.dfOut[key][date] = val

Method that is being passed to the thread:

    def thread_add_data(self, df, progress_bar, info_panel):
        df.apply(lambda x: self.add_data(x, self.whitelist, self.time_zones[self.time_zone], progress_bar, info_panel, self.dfOut), axis=1)

Threading being instantiated:

            t2 = threading.Thread(target=self.thread_add_data, args=(dfSplit[1], progress_bar, info_panel))
            t3 = threading.Thread(target=self.thread_add_data, args=(dfSplit[2], progress_bar, info_panel))
            t4 = threading.Thread(target=self.thread_add_data, args=(dfSplit[3], progress_bar, info_panel))

            t1.start()
            print("1...")
            t2.start()
            print("2...")
            t3.start()
            print("3...")
            t4.start()
            print("4...")
            
            t1.join(timeout=0.05)
            t2.join(timeout=0.05)
            t3.join(timeout=0.05)
            t4.join(timeout=0.05)
            print("...Complete")
  • If all the threads are trying to modify the same dataframe, you need to use locking around the part that does this. – Barmar Jul 20 '21 at 20:06
  • what if they are editing different parts of the dataframe? – Zach Plocher Jul 20 '21 at 20:11
  • See https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe – Barmar Jul 20 '21 at 20:15
  • Ah, thanks for the help, I'm not terribly familiar with locking, but would I only need to lock and unlock the threads when i am adding to the df and not when formatting the data correct? – Zach Plocher Jul 20 '21 at 20:20
  • Right. You only have to lock when accessing data that's shared between threads. Anything local to the thread is automatically safe. – Barmar Jul 20 '21 at 20:28
  • Threading won't speed up the code that you've posted. https://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python Consider using `concurrent.futures` or `joblib` and don't try to share a single dataframe, let each worker have its own copy and combine them at later time when all workers are done. – Dimitry Jul 20 '21 at 20:40

0 Answers0