Im scraping a website thru a list of link, 442 links in total. In each of the link have Dataframe, by using pd.read_html() I manage to pull the dataframe. So I tried to loop all the link and scrape all the dataframe and joined them, but after I finished everything, I found out that, some of the link have different dataframe positioning and I was unable to extract the Dataframe. How do I fix this problem. So sorry if I unable to explain it clearly, but here's my script :
allin = []
for link in titlelink :
driver.get(link)
html = driver.page_source
soup = bs(html, 'html.parser')
iframe = soup.find('iframe')['src']
#open iframe
openiframe = driver.get(iframe)
iframehtml = driver.page_source
print('fetching --',link)
#using pandas read html ang get table
All = pd.read_html(iframehtml)
try :
table1 = All[1].set_index([0, All[1].groupby(0).cumcount()])[1].unstack(0)
except :
table1 = All[2].set_index([0, All[2].groupby(0).cumcount()])[1].unstack(0)
try :
table2 = All[3].set_index([0, All[3].groupby(0).cumcount()])[1].unstack(0)
except :
pass
df = table1.join(table2)
try :
df['Remarks'] = All[2].iloc[1]
except :
df['Remarks'] = All[3].iloc[1]
allin.append(df)
finaldf = pd.concat(allin, ignore_index=True)
print(finaldf)
finaldf.to_csv('data.csv', index=False)
Also, I've exported all the links into csv and attached it here(https://drive.google.com/file/d/1Tk2oKVEZwfxAnHIx3p2HbACE6vOrJq5A/view?usp=sharing), so that you are able to get more clearer picture on the problem I've faced. Appreciate all of your help.