2

With the help of @JaSON, here's a code that enables me to get the data in the table from local html and the code uses selenium

from selenium import webdriver

driver = webdriver.Chrome("C:/chromedriver.exe")
driver.get('file:///C:/Users/Future/Desktop/local.html')
counter = len(driver.find_elements_by_id("Section3"))
xpath = "//div[@id='Section3']/following-sibling::div[count(preceding-sibling::div[@id='Section3'])={0} and count(following-sibling::div[@id='Section3'])={1}]"
print(counter)

for i in range(counter):
    print('\nRow #{} \n'.format(i + 1))
    _xpath = xpath.format(i + 1, counter - (i + 1))
    cells = driver.find_elements_by_xpath(_xpath)
    for cell in cells:
         value = cell.find_element_by_xpath(".//td").text
         print(value)

How can these rows converted to be valid table that I can export to csv file? Here's the local HTML link https://pastebin.com/raw/hEq8K75C

** @Paul Brennan: After trying to edit counter to be counter-1 I got 17 rows to skip the error of row 18 temporarily, I got the filename.txt and here's snapshot of the output enter image description here

YasserKhalil
  • 9,138
  • 7
  • 36
  • 95
  • 1
    https://stackoverflow.com/questions/45394374/trying-to-scrape-table-using-pandas-from-seleniums-result this will answer your problem. I could not tailor it to your solution as we cannot see your local HTML. – Paul Brennan Dec 08 '20 at 11:36
  • 1
    I have updated the post and attached the HTML link. – YasserKhalil Dec 08 '20 at 11:42

2 Answers2

1

I have modified your code to do a simple output. This is not very pythonic as it does not use vectorized creation of the Dataframe, but here is how it works. First set up pandas second set up a dataframe (but we don't know the columns yet) then set up the columns on the first pass (this will cause problems if there are variable column lengths Then input the values into the dataframe

import pandas as pd
from selenium import webdriver

driver = webdriver.Chrome("C:/chromedriver.exe")
driver.get('file:///C:/Users/Future/Desktop/local.html')
counter = len(driver.find_elements_by_id("Section3"))
xpath = "//div[@id='Section3']/following-sibling::div[count(preceding-sibling::div[@id='Section3'])={0} and count(following-sibling::div[@id='Section3'])={1}]"
print(counter)

df = pd.Dataframe()

for i in range(counter):
    print('\nRow #{} \n'.format(i + 1))
    _xpath = xpath.format(i + 1, counter - (i + 1))
    cells = driver.find_elements_by_xpath(_xpath)
    if i == 0:
        df = pd.DataFrame(columns=cells) # fill the dataframe with the column names
    for cell in cells:
        value = cell.find_element_by_xpath(".//td").text
        #print(value)
        if not value:  # check the string is not empty
            # always puting the value in the first item
            df.at[i, 0] = value # put the value in the frame

df.to_csv('filename.txt') # output the dataframe to a file

How this could be made better is to put the items in a row into a dictionary and put them into the datframe. but I am writing this on my phone so I cannot test that.

YasserKhalil
  • 9,138
  • 7
  • 36
  • 95
Paul Brennan
  • 2,638
  • 4
  • 19
  • 26
  • Thanks a lot for great help. After printing the data fro row 18, I got error `Message: no such element: Unable to locate element: {"method":"xpath","selector":".//td"}`. The error refers to that line `value = cell.find_element_by_xpath(".//td").text` and there is no file exported. As for the number of column are 10 (You can have a look at the HTML local file in the browser) – YasserKhalil Dec 08 '20 at 14:52
  • I have skipped row 18 to get the filename as output. Attached snapshot in the main post. – YasserKhalil Dec 08 '20 at 15:10
  • @QHarr I am sure you have experience at this field. – YasserKhalil Dec 09 '20 at 05:45
  • How about we skip the blank lines... That will get the comma out. – Paul Brennan Dec 11 '20 at 03:31
  • I have used `try.. except ` and after except I put `break`.this fixes the error and now I can get all the data in the same order of rows. How to get the data to be dataframe as the table on the webpage appears as I noticed the values are not in order? And I need to skip blank lines too. – YasserKhalil Dec 11 '20 at 03:35
  • I got df as 18 rows and that is OK but 180 columns (very weird as the columns should be 10 only). I noticed also that the code takes too long. Isn't there a faster approach? – YasserKhalil Dec 11 '20 at 03:38
0

With the great help of @Paul Brennan, I could modify the code so as to get the final desired output

import pandas as pd
from selenium import webdriver

driver = webdriver.Chrome("C:/chromedriver.exe")
driver.get('file:///C:/Users/Future/Desktop/local.html')
counter = len(driver.find_elements_by_id("Section3"))
xpath = "//div[@id='Section3']/following-sibling::div[count(preceding-sibling::div[@id='Section3'])={0} and count(following-sibling::div[@id='Section3'])={1}]"
finallist = []

for i in range(counter):
    #print('\nRow #{} \n'.format(i + 1))
    rowlist=[]
    _xpath = xpath.format(i + 1, counter - (i + 1))
    cells = driver.find_elements_by_xpath(_xpath)
    #if i == 0:
        #df = pd.DataFrame(columns=cells) # fill the dataframe with the column names
    for cell in cells:
        try:
            value = cell.find_element_by_xpath(".//td").text
            rowlist.append(value)
        except:
            break
    finallist.append(rowlist)
    
df = pd.DataFrame(finallist)
df[df.columns[[2, 0, 1, 7, 9, 8, 3, 5, 6, 4]]]

The code works well now but it is too slow. Is there a way to make it faster?

YasserKhalil
  • 9,138
  • 7
  • 36
  • 95