2

I am using Selenium to extract data from the HTML body of a webpage and am writing the data to a .csv file using pandas.

The data is extracted and written to the file, however I would like to manipulate the formatting of the data to write to specified columns, after reading many threads and docs I am not able to understand how to do this.

The current CSV file output is as follows, all data in one row or one column

0,
B09KBFH6HM,
dropdownAvailable,
90,
1,
B09KBNJ4F1,
dropdownAvailable,
100,
2,
B09KBPFPCL,
dropdownAvailable,
110

or if I use the [count] count +=1 method it will be one row

0,B09KBFH6HM,dropdownAvailable,90,1,B09KBNJ4F1,dropdownAvailable,100,2,B09KBPFPCL,dropdownAvailable,110

I would like the output to be formatted as follows,

/col1 /col2      /col3             /col4 
0,   B09KBFH6HM, dropdownAvailable, 90, 
1,   B09KBNJ4F1, dropdownAvailable, 100,    
2,   B09KBPFPCL, dropdownAvailable, 110

I have tried using columns= options but get errors in the terminal and don't understand what feature I should be using to achieve this in the docs under the append details

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html?highlight=append#pandas.DataFrame.append

A simplified version is as follows

from selenium import webdriver
import pandas as pd

price = []

driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")


select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
    price.append(element.get_attribute("value"))
    price.append(element.get_attribute("class"))
    price.append(element.get_attribute("data-a-html-content"))


output = pd.DataFrame(price)
output.to_csv("Data.csv", encoding='utf-8-sig')

driver.close()

Do I need to parse each item separately and append? I would like each of the .get_attribute values to be written to a new column.

Is there any advice you can offer for a solution to this as I am not very proficient at pandas, thank you for your helps

crawf
  • 75
  • 6

3 Answers3

1

Adding all your items to the price list is going to cause them all to be in one column. Instead, store separate lists for each column, in a dict, like this (name them whatever you want):

data = {
    'values': [],
    'classes': [],
    'data_a_html_contents': [],
}

...

for element in options:
    values.append(element.get_attribute("value"))
    classes.append(element.get_attribute("class"))
    data_a_html_contents.append(element.get_attribute("data-a-html-content"))

...

output = pd.DataFrame(data)
output.to_csv("Data.csv", encoding='utf-8-sig')
  • Thank your for this informative reply, you have confirmed the issue that I was appending everything to one list, I really like to simple solution of using lists and will learn more about this method. I am not able to implement your solution though, although I tried many times I always get the error `values.append(element.get_attribute("value")) NameError: name 'values' is not defined` I tried several variations and places in the script, but was unable to get it working, it is really a shame I couldn't get it to work (due to my own lack of knowledge) Thank you for the reply – crawf Nov 28 '21 at 00:51
1

 Approach similar to @user17242583, but a little shorter:

data = [[e.get_attribute("value"), e.get_attribute("class"), e.get_attribute("data-a-html-content")] for e in options]

df = pd.DataFrame(data, columns=['ASIN', 'dropdownAvailable', 'size']) # third column maybe is the product size
df.to_csv("Data.csv", encoding='utf-8-sig')
user11717481
  • 1
  • 9
  • 15
  • 25
  • Thank you for your feedback on this problem, although your answer is the one I understand least due to the syntax ordering (I will do further research to lean more about this style) This answer is the one that worked best for me literally straight out the box with clear columns and ordered results as desired. Thank you very much for taking the time to assist me, it is really appreciated and I can now continue with the project. – crawf Nov 28 '21 at 00:42
1

You were collecting the value, class and data-a-html-content and appending them to the same list price. Hence, the list becomes:

price = [value1, class1, data-a-html-content1, value2, class2, data-a-html-content2, ...]

Hence, within the it looks like:

dataframe


Solution

To get value, class and data-a-html-content in seperate columns you can adopt any of the below two approaches:

  • Pass a dictionary to the dataframe.
  • Pass a list of lists to the dataframe.

While the @user17242583 and @h.devillefletcher suggests a dictionary, you can still achieve the same using list of lists as follows:

values = []
classes = []
data-a-html-contents = []

driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")


select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
    values.append(element.get_attribute("value"))
    classes.append(element.get_attribute("class"))
    data-a-html-contents.append(element.get_attribute("data-a-html-content"))

df = pd.DataFrame(data=list(zip(values, classes, data-a-html-contents)), columns=['Value', 'Class', 'Data-a-Html-Content'])

output = pd.DataFrame(my_list)
output.to_csv("Data.csv", encoding='utf-8-sig')

References

You can find a couple of relevant detailed discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thank you @DebanjanB for this very informative reply and resources, I will continue to learn thanks to your advice and clarification, the output writes each data list as rows as seen in this image [link](https://i.imgur.com/QSRTstV.png) I tried to add the columns (3 column headers) as seen in the dataframe link you posted but get the following error `ValueError: 3 columns passed, passed data had 1 columns` could that be due to the comma separator not being recognized? Thank you for helping me learn with this project! – crawf Nov 28 '21 at 01:33
  • @crawf There was a small bug in my code which I have addressed. Can you please update me if the current code works for you? – undetected Selenium Nov 29 '21 at 19:29