Clean Dates from scraping

Question

I am scraping an HMTL table and the final dataframe creates a `Date' column that needs cleaning and formatting.

My scope is to convert this column to a data column.

Below my dataframe:

All I want to do after this step is to clean up the Datecolumn and convert this column to a pandas date column.

Any help?

Here is how to produce this table:

## web scrapping 
import requests
import lxml.html as lh
import pandas as pd

url='https://markets.ft.com/data/funds/tearsheet/historical?s=LU0841585341:GBP'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)

#Store the contents of the website under doc
doc = lh.fromstring(page.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
   # print '%d:"%s"'%(i,name)
    col.append((name,[]))

#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=6:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()

Please [do not post images](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question) of your data or errors. You can include [code that creates a dataframe such as `df.to_dict()` or the output of `print(df)`](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) (include at least the few rows and columns that allow to reproduce your issue) — Cimbali, Sep 11 '21 at 09:32

score 1 · Accepted Answer · answered Sep 11 '21 at 09:37

You can do:

df["Date"] = pd.to_datetime(
    df["Date"].str.replace(r"(\d+)([A-Z].*)", r"\1", regex=True)
)
print(df)

Prints:

         Date   Open   High    Low  Close Volume
0  2021-09-10  27.28  27.28  27.28  27.28   ----
1  2021-09-09  27.35  27.35  27.35  27.35   ----
2  2021-09-08  27.42  27.42  27.42  27.42   ----
3  2021-09-07  27.54  27.54  27.54  27.54   ----
4  2021-09-03  27.44  27.44  27.44  27.44   ----
5  2021-09-02  27.48  27.48  27.48  27.48   ----
6  2021-09-01  27.26  27.26  27.26  27.26   ----
7  2021-08-31  27.31  27.31  27.31  27.31   ----
8  2021-08-30  27.46  27.46  27.46  27.46   ----
9  2021-08-27  27.32  27.32  27.32  27.32   ----
10 2021-08-26  27.23  27.23  27.23  27.23   ----
11 2021-08-25  27.27  27.27  27.27  27.27   ----
12 2021-08-24  27.22  27.22  27.22  27.22   ----
13 2021-08-23  27.05  27.05  27.05  27.05   ----
14 2021-08-20  26.92  26.92  26.92  26.92   ----
15 2021-08-19  26.58  26.58  26.58  26.58   ----
16 2021-08-18  26.62  26.62  26.62  26.62   ----
17 2021-08-17  26.63  26.63  26.63  26.63   ----
18 2021-08-16  26.56  26.56  26.56  26.56   ----
19 2021-08-13  26.77  26.77  26.77  26.77   ----
20 2021-08-12  26.67  26.67  26.67  26.67   ----

score 0 · Answer 2 · answered Sep 11 '21 at 09:35

You can convert the string to a datetime like this:

from datetime import datetime 

d='September 10, 2021Fri, Sep 10, 2021'
print(datetime.strptime(''.join(d.split(',')[-2:]), ' %b %d %Y'))

output: 2021-09-10 00:00:00

The different steps in above are:

Clean Dates from scraping

2 Answers2