0

Im scraping a website thru a list of link, 442 links in total. In each of the link have Dataframe, by using pd.read_html() I manage to pull the dataframe. So I tried to loop all the link and scrape all the dataframe and joined them, but after I finished everything, I found out that, some of the link have different dataframe positioning and I was unable to extract the Dataframe. How do I fix this problem. So sorry if I unable to explain it clearly, but here's my script :

allin = []

for link in titlelink :
    driver.get(link)
    html = driver.page_source
    soup = bs(html, 'html.parser')
    iframe = soup.find('iframe')['src']
    #open iframe
    openiframe = driver.get(iframe)
    iframehtml = driver.page_source
    print('fetching --',link)
    #using pandas read html ang get table
    All = pd.read_html(iframehtml)
    try :
        table1 = All[1].set_index([0, All[1].groupby(0).cumcount()])[1].unstack(0)
    except :
        table1 = All[2].set_index([0, All[2].groupby(0).cumcount()])[1].unstack(0)
    try :
        table2 = All[3].set_index([0, All[3].groupby(0).cumcount()])[1].unstack(0)
    except :
        pass
    df = table1.join(table2)
    try :
        df['Remarks'] = All[2].iloc[1]
    except :
        df['Remarks'] = All[3].iloc[1]
    allin.append(df)
    
finaldf = pd.concat(allin, ignore_index=True)
print(finaldf)
finaldf.to_csv('data.csv', index=False)

Also, I've exported all the links into csv and attached it here(https://drive.google.com/file/d/1Tk2oKVEZwfxAnHIx3p2HbACE6vOrJq5A/view?usp=sharing), so that you are able to get more clearer picture on the problem I've faced. Appreciate all of your help.

  • You want to merge `Proposed company name`,`Remarks`,`Announcement Info` into one `DataFrame`? – imxitiz Jul 22 '21 at 02:33
  • @Xitiz yes, but some of the link it has extra feature which are "Admission Sponsor" and "Sponsor" – Yazid Yaakub Jul 22 '21 at 02:39
  • And you want to add that too in your final `df`? – imxitiz Jul 22 '21 at 02:41
  • @Xitiz yes I want add all in final df including the extra feature – Yazid Yaakub Jul 22 '21 at 02:44
  • Does this answer your question? [how to iterate scraping all the table in a list of url?](https://stackoverflow.com/questions/68407407/how-to-iterate-scraping-all-the-table-in-a-list-of-url) – αԋɱҽԃ αмєяιcαη Jul 22 '21 at 05:38
  • @αԋɱҽԃαмєяιcαη that question is from same person, why would OP ask same questions again. And you are the one who answered that question and that is already accepted. If that is not accepted then we may think OP didn't notice your answer but if it is accepted and asking for new question then this might not be similar to that question. – imxitiz Jul 22 '21 at 06:12
  • @αԋɱҽԃαмєяιcαη please dont get me wrong, the previous question has same feature on both of the link, these questions consist of 442 links and each of it have different feature and table positioning. So with a little knowledge of me with the python language, i'm seeking for the solutions. – Yazid Yaakub Jul 22 '21 at 07:16

2 Answers2

1

I had found some pattern in the links, so I I tried this and it is now working fine. Not 100% perfectly but it is working, 95%. Here's the code:

import pandas as pd
import requests

df=pd.read_csv("link.csv") # That google drive document

links=df["0"].values.tolist()

for link in links:
    nlink=f"https://disclosure.bursamalaysia.com/FileAccess/viewHtml?e={link.split('ann_id=')[1]}"
    page=requests.get(nlink,headers={"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15"})
    df=pd.read_html(page.text)
    df=pd.concat(df[1:],axis=0).to_numpy().flatten()
    
    df=pd.DataFrame(df[~pd.isna(df)].reshape(-1,2))
    # For explanation about these last two line you may check here https://stackoverflow.com/questions/68479177/how-to-shift-a-dataframe-element-wise-to-fill-nans
    print(df)

You have to change most of the things, as your needs. If you need any help, you can ask in comment. It's not technically a complete answer but it speeds up your scraping and also gives your desire output but not completely. You can take it as suggestion and idea to solve issue 90% to 95%.

imxitiz
  • 3,920
  • 3
  • 9
  • 33
0

been try and error and finally, I get the ans for my own question. Here's the script

frame = []

for link in titlelink :
    time.sleep(1)
    driver.get(link)
    html = driver.page_source
    soup = bs(html, 'html.parser')
    iframe = soup.find('iframe')['src']
    #open iframe
    openiframe = driver.get(iframe)
    iframehtml = driver.page_source
    print('fetching --',link)
    #using pandas read html ang get table
    df_proposed_company_name = pd.read_html(iframehtml,  match='Proposed company name')[0]
    df_announcement_info = pd.read_html(iframehtml, match='Stock Name ')[0]
   
    try:
        df_remarks = pd.read_html(iframehtml, match='Remarks :')[0].iloc[1]
    except:
        pass
    
    try : 
        df_Admission_Sponsor = pd.read_html(iframehtml, match='Admission Sponsor')[1]
    except :
        pass

    
    try:
        t1_1 = df_Admission_Sponsor.set_index([0,df_Admission_Sponsor.groupby(0).cumcount()])[1].unstack(0)
    except:
        t1_1 = pd.DataFrame({'Admission Sponsor':np.nan,
                           'Sponsor':np.nan},index=[0])
    t1_2 = df_proposed_company_name.set_index([0, df_proposed_company_name.groupby(0).cumcount()])[1].unstack(0)
    t3 = df_announcement_info.set_index([0, df_announcement_info.groupby(0).cumcount()])[1].unstack(0)

    dfs = t1_1.join(t1_2).join(t3)
    try:
        dfs['remarks'] = df_remarks
    except:
        dfs['remarks'] = np.nan
        
    frame.append(dfs)
    
finaldf = pd.concat(frame, ignore_index=True)
# print(finaldf)
finaldf.to_csv('data.csv', index=False)

if any of you have more advance experience and better solutions, i'm open to it and learn new things from you :-)