-2

I have a link to the website with table like the follow: https://www.timeanddate.com/holidays/kenya/2022

How can I:

  1. read this table as DataFrame in Jupyter Notebook in Python ?
  2. Convert column "Date" so as to have date format like "01.01.2022" not as exists on website "1 sty"
  3. how to create column "Day" where will be value like: sobota, niedziela and so on which currently are between columns "Date" and "Name" ?

So, as a result I need something like below:

Date Day Name Type
01.01.2022 sobota New Year's Day Public holiday
20.03.2022 niedziela March Equinox Season
... ... ... ...

How can I do that in Python Pandas ?

dingaro
  • 2,156
  • 9
  • 29
  • show us the code attempts you did so far? where you are unable to move forward? – simpleApp Dec 25 '22 at 13:22
  • I tried many solutions from internet, which did not work :/ – dingaro Dec 25 '22 at 13:32
  • pls add your most recent code and also specify what's not working, so we can suggest the tweak you would need. – simpleApp Dec 25 '22 at 13:38
  • 1
    for example, this is the starter code and then fix each need 1 by 1. `import pandas as pd df=pd.read_html("https://www.timeanddate.com/holidays/kenya/2022",attrs = {'id': 'holidays-table'}) df[0]=df[0].dropna(axis = 0, how = 'all') df[0].head(20)` – simpleApp Dec 25 '22 at 13:44
  • simpleApp, when I use your code I have error like that: AttributeError: module 'html5lib.treebuilders' has no attribute '_base' – dingaro Dec 25 '22 at 13:45
  • 2
    you have something else going on vs pandas error. [solve error](https://stackoverflow.com/questions/38447738/beautifulsoup-html5lib-module-object-has-no-attribute-base) pls look at these reference to fix the error. most likely you would need this `pip install --upgrade beautifulsoup4 pip install --upgrade html5lib` – simpleApp Dec 25 '22 at 13:50
  • simpleApp it is ok, but I still do not have idea how to convert date column to usefull form :/ – dingaro Dec 25 '22 at 14:05

2 Answers2

0

You can do this thanks to beautifulsoup library ... If you click right in the web page with google chrome, you can see the structure of the web page, it's well structured, and easy to extract data between html tags. Also, if you want to extract data of all years, just loop on the web url.

https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...

notafk
  • 11
  • 3
  • could you present sample of code with loop ? – dingaro Dec 25 '22 at 13:44
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 26 '22 at 02:07
0

read this table as DataFrame

You can probably just use pandas.read_html directly.

# import pandas
khdf = pandas.read_html('https://www.timeanddate.com/holidays/kenya/2022')[0]

and to clean up a bit by resetting the column headers and getting rid of empty rows:

khdf = khdf.set_axis(
    ['Date', 'Day', 'Name', 'Type'], axis='columns'
).dropna(axis='rows', how='all')

Convert column "Date" so as to have date format like "01.01.2022"

You can parse the date with dateutil.parser and then format it with .strftime.

# from dateutil.parser import parse as duParse
y = 2022
khdf['Date'] = [duParse(f'{y} {d}').strftime('%d.%m.%Y') for d in khdf['Date']]

how to create column "Day" where will be value like: sobota, niedziela and so on

As it is so far, we already have a Day column with Monday/Tuesday/etc., but if you want them in Polish, you could use a translation dictionary [like daysDict below].

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
khdf['Day'] = [daysDict[d] if d in daysDict else d for d in khdf['Day']]

If you want to translate everything [except for Date], you could use the googletrans module. (I think the version installed by default has some issues, but 3.1.0a0 works for me.)

# !pip install googletrans==3.1.0a0
# from googletrans import Translator
translator = Translator()
for c in ['Day', 'Name', 'Type']:
    khdf[c] = [translator.translate(d, src='en', dest='pl').text for d in khdf[c]]


[because you commented about] "sample of code with loop"

Since the page links have a consistent format, you can loop through various countries and years.

First, import the necessary libraries and define the translation dictionary along with a function that tries to parse and format the date (but returns a null value (None) if it fails):

import pandas
from dateutil.parser import parse as duParse

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}

def try_dup(dStr, yr):
    try: return duParse(f'{yr} {dStr}').strftime('%d.%m.%Y') 
    except: return None

then, set the start and end years as well as a list of countries:

startYear, endYear = 2010, 2030
countryList = ['kenya', 'tonga', 'belgium']

now, we're ready to loop though the countries and years to collect data:

dfList = []
for country in countryList:
    for y in range(startYear, endYear+1):
        try: 
            cyUrl = f'https://www.timeanddate.com/holidays/{country}/{y}'
            cydf = pandas.read_html(cyUrl)[0]

            cydf = cydf.drop(# only the first 4 columns are kept
                [c for c in cydf.columns[4:]], axis='columns'
            ).set_axis(['Date', 'Day', 'Name', 'Type'], axis='columns')
            cydf['Date'] = [try_dup(d, y) for d in cydf['Date']] # parse+format date
            cydf['Country'] = country.capitalize() # add+fill a column with country name

            dfList.append(cydf.dropna(axis='rows', subset=['Date'])) # only add rows with Date
            # print('', end=f'\r{len(dfList[-1])} holidays scraped from {cyUrl}')
        # except: continue ## skip without printing error
        except Exception as e: 
            print('\n', type(e), e, '- failed to scrape from', cyUrl)
# print('\n\n', len(dfList), 'dataframes with', sum([len(d) for d in dfList]),'holidays scraped overall')

After looping, all the DataFrames can be combined into one before translating the days:

acydf = pandas.concat(dfList, ignore_index=True)
acydf['Day'] = [daysDict[d] if d in daysDict else d for d in acydf['Day']] # translate days
acydf = acydf[['Country', 'Date', 'Day', 'Name', 'Type']] # rearrange columns

A sample of acydf [printed with print(acydf.loc[::66].to_markdown(index=False))]:

| Country   | Date       | Day          | Name                                          | Type                        |
|:----------|:-----------|:-------------|:----------------------------------------------|:----------------------------|
| Kenya     | 01.01.2012 | Niedziela    | New Year's Day                                | Public holiday              |
| Kenya     | 19.07.2015 | Niedziela    | Eid al-Fitr                                   | Public holiday              |
| Kenya     | 10.10.2018 | Środa        | Moi Day                                       | Public holiday              |
| Kenya     | 10.10.2021 | Niedziela    | Huduma Day                                    | Public holiday              |
| Kenya     | 26.12.2023 | Wtorek       | Boxing Day                                    | Public holiday              |
| Kenya     | 01.01.2027 | Piątek       | New Year's Day                                | Public holiday              |
| Kenya     | 14.04.2030 | Niedziela    | Eid al-Adha (Tentative Date)                  | Optional Holiday            |
| Tonga     | 17.09.2012 | Poniedziałek | Birthday of Crown Prince Tupouto'a-'Ulukalala | Public Holiday              |
| Tonga     | 25.04.2016 | Poniedziałek | ANZAC Day                                     | Public Holiday              |
| Tonga     | 04.12.2019 | Środa        | Anniversary of the Coronation of King Tupou I | Public Holiday              |
| Tonga     | 04.06.2023 | Niedziela    | Emancipation Day                              | Public Holiday              |
| Tonga     | 01.01.2027 | Piątek       | New Year's Day                                | Public Holiday              |
| Tonga     | 04.11.2030 | Poniedziałek | Constitution Day                              | Public Holiday              |
| Belgium   | 06.12.2011 | Wtorek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2013 | Piątek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2015 | Niedziela    | St. Nicholas Day                              | Observance                  |
| Belgium   | 15.11.2017 | Środa        | Day of the German-speaking Community          | Regional government holiday |
| Belgium   | 01.11.2019 | Piątek       | All Saints' Day                               | National holiday            |
| Belgium   | 31.10.2021 | Niedziela    | Halloween                                     | Observance                  |
| Belgium   | 23.09.2023 | Sobota       | September Equinox                             | Season                      |
| Belgium   | 15.08.2025 | Piątek       | Assumption of Mary                            | National holiday            |
| Belgium   | 11.07.2027 | Niedziela    | Day of the Flemish Community                  | Regional government holiday |
| Belgium   | 10.06.2029 | Niedziela    | Father's Day                                  | Observance                  |
Driftr95
  • 4,572
  • 2
  • 9
  • 21