So I have a class that formats a csv using pandas data frames. The amount of data in the csv's is very large and I would like to introduce multithreading into the program in order to speed up the process. The issue I am running into is that the Data frame's values don't get altered when I use multithreading, any tips?
Method that adds data to the dataframe and tracks progress:
def add_data(self, x, whitelist, time_zone, progress_bar, info_panel):
y, mo, d, h, mi, s = self.parseDateTime(x['date'])
date = (dt.datetime(y, mo, d, h, mi) + dt.timedelta(hours=self.time_zones[self.time_zone]))
if date >= self.DST_start and date < self.DST_end:
date += self.DST_diff
date = date.strftime("%m/%d/%Y %I:%M %p")
key = x['keys']
val = x['val']
self.current_entry += 1
progress_bar['value'] = (self.current_entry/self.total_entries) *100
info_panel.update_idletasks()
info_panel.update()
if (val != 'NaN'):
if(key in whitelist):
try:
temp = float(val)
if self.dfOut.isna()[key][date]:
self.dfOut[key][date] = temp
else:
self.dfOut[key][date] += temp
self.dfAvg[key][date] += 1
except ValueError:
self.dfOut[key][date] = val
Method that is being passed to the thread:
def thread_add_data(self, df, progress_bar, info_panel):
df.apply(lambda x: self.add_data(x, self.whitelist, self.time_zones[self.time_zone], progress_bar, info_panel, self.dfOut), axis=1)
Threading being instantiated:
t2 = threading.Thread(target=self.thread_add_data, args=(dfSplit[1], progress_bar, info_panel))
t3 = threading.Thread(target=self.thread_add_data, args=(dfSplit[2], progress_bar, info_panel))
t4 = threading.Thread(target=self.thread_add_data, args=(dfSplit[3], progress_bar, info_panel))
t1.start()
print("1...")
t2.start()
print("2...")
t3.start()
print("3...")
t4.start()
print("4...")
t1.join(timeout=0.05)
t2.join(timeout=0.05)
t3.join(timeout=0.05)
t4.join(timeout=0.05)
print("...Complete")