0

I am extracting a HTML Table from Web with Pandas. In this result (List of Dataframe Objects) I want to return all Dataframes where the Cell Value is an Element of an given Array.

So far I am struggling to call only one one column value and not the whole Object.

Syntax of Table: (the Header Lines are not extracted correctly so this i the real Output)

0 1 2 3
Date Name Number Text
09.09.2022 Smith Jason 3290 Free Car Wash
12.03.2022 Betty Paulsen 231 10l Gasoline
import pandas as pd
import numpy as np

url = f'https://some_website.com'

df = pd.read_html(url)

arr_Nr = ['3290', '9273']

def correct_number():
    for el in df[0][1]:
        if (el in arr_Nr):
            return True

def get_winner():
    for el in df:
        if (el in arr_Nr):
            return el

print(get_winner())

With the Function

correct_number()

I can output that there is a Winner, but not the Details, when I try to access

get_winner()

EDIT

So far I now think I got one step closer: The function read_html() returns a list of DataFrame Objects. In my example, there is only one table so accessing it via df = dfs[0] I should get the correct DataFrame Object.

But now when I try the following, the Code don't work as expected, there is no Filter applied and the Table is returned in full:

df2 = df[df.Number == '3290'] print(df2)

  • you need to set the first line as your header.. you can find the answer here https://stackoverflow.com/questions/31328861/python-pandas-replacing-header-with-top-row Then you can access each column this way df.column_name – yasmine Dec 13 '22 at 16:21
  • Thanks for that, I was able to define the regular headers with some details in read_html:dfs = pd.read_html(url, header =0, flavor = 'bs4'). Now how can I access one column of a single DataFrame? – senior_freshman Dec 13 '22 at 17:23
  • you can access the dataframe using df.column_name for an example if you want to access Date use df.Date and if you want to access a specific line in the column using the row index you can do this df.Date.loc[index] – yasmine Dec 13 '22 at 20:20

1 Answers1

0

Okay i finally figured it out:

Pandas returned List of DataFrame Objects, in my example there is only one table, to access this Table aka the DataFrame Object I had to access it first. Before I then could compare the Values, I parsed them to integers, Pandas seemed to extract them as char, so my Array couldn't compare them properly.

In the End the code looks way more elegant that I thought before:

import pandas as pd
import numpy as np

url = f'https://mywebsite.com/winners-2022'

dfs_list = pd.read_html(url,  header =0, flavor = 'bs4') 
df = dfs_list[0] 

winner_nrs = [3290, 843]

result = df[df.Losnummer.astype(int).isin(winner_nrs)]