0

I've been working some scraping script, trying various way to export the extracted data to csv. I find that csv.writer is much faster than to_csv, I dint actually time it but it fast. Unfortunately, my output using csv.writer is not what I want.

This is the output the function to_csv :

to_csv output

to_csv output is what I expected but it took few hours for it to export 9 thousand rows of data. So I tried using csv.writer, it round faster that to_csv function but unfortunately, the output is not what I expected.

Here's my script :

def writerows(rows, filename):
    with open(filename,'a',encoding='UTF8',newline='') as toWrite:
        writer = csv.writer(toWrite)
        writer.writerow(header)
        writer.writerows([rows])

start = datetime.datetime.now()
count = 0
allin = []
for html in results:
    count+=1
    soup =bs(html,'html.parser')
    try :
        all = soup.find('td',class_='FootNote').text
    except :
        all = np.nan

    name = pd.read_html(html, match='Name')[0].set_index([0, pd.read_html(html, match='Name')[0].groupby(0).cumcount()])[1].unstack(0)
    comname = pd.read_html(html, match='Company Name')[0].set_index([0, pd.read_html(html, match='Company Name')[0].groupby(0).cumcount()])[1].unstack(0)

    try :
        adm = pd.read_html(html, match='Admission Sponsor', index_col=0)[0].T
    except :
        adm = pd.DataFrame({'Admission Sponsor':np.nan,'Sponsor':np.nan},index=[0])

    df = name.join(comname).join(adm)

    df['remark']= all
    allin.append(df)
    finaldf = pd.concat(allin, ignore_index=True)
    
#     finaldf.to_csv('testing01.csv', index=0) >>>> to_csv function
#     writerows(finaldf, 'testing.csv') >>>> csv.writer function
    print(count,'|', end=' ')
finish = datetime.datetime.now() - start 
print("Time Taken:",finish)

I was expecting the output is the same as to_csv function's output. I'm pretty much sure that my script is lack of something due to unable to achieve the same result as to_csv function. Greatly appreciate for your guidance and explanation.

  • Your `writerows()` function is expecting a list of lists to be supplied in `rows`. But you are calling it with a 1-list that contains a dataframe. `csv.writer` does not understand dataframes. Instead of putting the data in a dataframe, you will need to put your scraped data in a list of lists, the way the `csv` module expects. – BoarGules Jul 31 '21 at 15:27
  • @BoarGules do you have any website that I refer to? I'm kinda new to this things so it will be a bit struggle for me to understand it. Appreciate it you able to share me some website reference of example – Yazid Yaakub Jul 31 '21 at 15:37
  • A good place to start would be the documentation for the `csv` module: https://docs.python.org/3/library/csv.html – BoarGules Jul 31 '21 at 15:39
  • @BoarGules appreciate it. Btw, is there any reference regarding on how to put my scraped data in a list to lists?If any, I would like to see some example – Yazid Yaakub Jul 31 '21 at 15:42
  • As for the last question, probably you need this: https://stackoverflow.com/questions/54549284/convert-dataframe-to-2d-array Just a guess: something like this `writerows(finaldf.to_numpy(), 'testing.csv')` – Yuri Khristich Jul 31 '21 at 18:02

0 Answers0