I've been working some scraping script, trying various way to export the extracted data to csv. I find that csv.writer is much faster than to_csv, I dint actually time it but it fast. Unfortunately, my output using csv.writer is not what I want.
This is the output the function to_csv :
to_csv output is what I expected but it took few hours for it to export 9 thousand rows of data. So I tried using csv.writer, it round faster that to_csv function but unfortunately, the output is not what I expected.
Here's my script :
def writerows(rows, filename):
with open(filename,'a',encoding='UTF8',newline='') as toWrite:
writer = csv.writer(toWrite)
writer.writerow(header)
writer.writerows([rows])
start = datetime.datetime.now()
count = 0
allin = []
for html in results:
count+=1
soup =bs(html,'html.parser')
try :
all = soup.find('td',class_='FootNote').text
except :
all = np.nan
name = pd.read_html(html, match='Name')[0].set_index([0, pd.read_html(html, match='Name')[0].groupby(0).cumcount()])[1].unstack(0)
comname = pd.read_html(html, match='Company Name')[0].set_index([0, pd.read_html(html, match='Company Name')[0].groupby(0).cumcount()])[1].unstack(0)
try :
adm = pd.read_html(html, match='Admission Sponsor', index_col=0)[0].T
except :
adm = pd.DataFrame({'Admission Sponsor':np.nan,'Sponsor':np.nan},index=[0])
df = name.join(comname).join(adm)
df['remark']= all
allin.append(df)
finaldf = pd.concat(allin, ignore_index=True)
# finaldf.to_csv('testing01.csv', index=0) >>>> to_csv function
# writerows(finaldf, 'testing.csv') >>>> csv.writer function
print(count,'|', end=' ')
finish = datetime.datetime.now() - start
print("Time Taken:",finish)
I was expecting the output is the same as to_csv function's output. I'm pretty much sure that my script is lack of something due to unable to achieve the same result as to_csv function. Greatly appreciate for your guidance and explanation.