-1

I'm trying to clean a table scraped from a website.

I have two questions:

  1. I'm not sure why my code below is producing a list of lists instead of just one list
  2. I'm scraping each column into an individual list and then converting them into a dataframe. Is it a good practice to do the data cleaning in the list or do I do the cleaning after they're converted into a dataframe?
doc_name = driver.find_elements(By.XPATH, "//*[@id='docflow.list_DocFlowList']/tbody/tr/td/table/tbody/tr/td[3]")

doc_name_cleaned = [re.findall(r'\d+',i.text) for i in doc_name]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Nico
  • 15
  • 5

1 Answers1

2
doc_name_cleaned = [re.findall(r'\d+',i.text) for i in doc_name]

In the above line re.findall() function returns a list of matches(it can be more than one). Since you're matching pattern for a list of texts, the result is a list of lists.

You can try this, if you just want the text.

doc_name_cleaned = []
for i in doc_name:
    matches= re.findall(r'\d+',i.text)
    if matches:
        doc_name_cleaned.append(matches[0])
    else:
        doc_name_cleaned.append('')

Amith Lakkakula
  • 506
  • 3
  • 8
  • Thank you! I was considering this `for` loop option but was wondering if a list comprehension would be a good way to do what I wanted to do. Guess not. – Nico Oct 23 '20 at 14:48