1

I extract the data from a webpage but would like to arrange it into the pandas dataframe table.

finviz = requests.get('https://finviz.com/screener.ashx?v=152&o=ticker&c=0,1,2,3,4,5,6,7,10,11,12,14,16,17,19,21,22,23,24,25,31,32,33,38,41,48,65,66,67&r=1')
finz = html.fromstring(finviz.content)
col = finz.xpath('//table/tr/td[@class="table-top"]/text()')
data = finz.xpath('//table/tr/td/a[@class="screener-link"]/text()')

Col is the column for the pandas dataframe and each of the 28 data points in data list will be arranged accordingly into rows. data points 29 to 56 in the second row and so forth. How to write the code elegantly?

datalist = []
for y in range (28):
       datalist.append(data[y])
>>> datalist
['1', 'Agilent Technologies, Inc.', 'Healthcare', 'Medical Laboratories & Research', 'USA', '23.00B', '29.27', '4.39', '4.53', '18.76', '1.02%', '5.00%', '5.70%', '3
24.30M', '308.52M', '2.07', '8.30%', '15.70%', '14.60%', '1.09', '1,775,149', '2', 'Alcoa Corporation', 'Basic Materials', 'Aluminum', 'USA', '1.21B', '-']

But the result is not in table form like dataframe

bkcollection
  • 913
  • 1
  • 12
  • 35

1 Answers1

0

Pandas has a function to parse HTML: pd.read_html

You can try the following:

# Modules
import pandas as pd
import requests

# HTML content
finviz = requests.get('https://finviz.com/screener.ashx?v=152&o=ticker&c=0,1,2,3,4,5,6,7,10,11,12,14,16,17,19,21,22,23,24,25,31,32,33,38,41,48,65,66,67&r=1')

# Convert to dataframe
df = pd.read_html(finviz.content)[-2]

# Set 1st row to columns names
df.columns = df.iloc[0]
# Drop 1st row
df = df.drop(df.index[0])

# df = df.set_index('No.')
print(df)
# 0  No. Ticker                                     Company            Sector                               Industry      Country  ... Debt/Eq Profit M  Beta   Price  Change    Volume
# 1    1      A                  Agilent Technologies, Inc.        Healthcare        Medical Laboratories & Research          USA  ...    0.51   14.60 % 1.20   72.47 - 0.28 % 177333
# 2    2     AA                           Alcoa Corporation   Basic Materials                               Aluminum          USA  ...    0.44 - 10.80 % 2.03    6.28   3.46 % 3021371
# 3    3   AAAU                Perth Mint Physical Gold ETF         Financial                   Exchange Traded Fund          USA  ... - - - 16.08 - 0.99 % 45991
# 4    4   AACG                       ATA Creativity Global          Services          Education & Training Services        China  ...    0.02 - 2.96    0.95 - 0.26 % 6177
# 5    5   AADR         AdvisorShares Dorsey Wright ADR ETF         Financial                   Exchange Traded Fund          USA  ... - - - 40.80   0.22 % 1605
# 6    6    AAL                American Airlines Group Inc.          Services                         Major Airlines          USA  ... - 3.70 % 1.83   12.81   4.57 % 16736506
# 7    7   AAMC     Altisource Asset Management Corporation         Financial                       Asset Management          USA  ... - -17.90 % 0.78   12.28   0.00 % 0
# 8    8   AAME               Atlantic American Corporation         Financial                         Life Insurance          USA  ...    0.28 - 0.40 % 0.29    2.20   3.29 % 26
# 9    9    AAN                               Aaron's, Inc.          Services              Rental & Leasing Services          USA  ...    0.20    0.80 % 1.23   22.47 - 0.35 % 166203
# 10  10   AAOI               Applied Optoelectronics, Inc.        Technology    Semiconductor - Integrated Circuits          USA  ...    0.49 - 34.60 % 2.02    7.80   2.63 % 61303
# 11  11   AAON                                  AAON, Inc.  Industrial Goods             General Building Materials          USA  ...    0.02   11.40 % 0.88   48.60   0.71 % 20533
# 12  12    AAP                    Advance Auto Parts, Inc.          Services                      Auto Parts Stores          USA  ...    0.21    5.00 % 1.04   95.94 - 0.58 % 165445
# 13  13   AAPL                                  Apple Inc.    Consumer Goods                   Electronic Equipment          USA  ...    1.22   21.50 % 1.19  262.39   2.97 % 11236642
# 14  14    AAT                 American Assets Trust, Inc.         Financial                          REIT - Retail          USA  ...    1.03   12.50 % 0.99   25.35   2.78 % 30158
# 15  15    AAU                       Almaden Minerals Ltd.   Basic Materials                                   Gold       Canada  ...    0.04 - 0.53    0.28 - 1.43 % 34671
# 16  16   AAWW          Atlas Air Worldwide Holdings, Inc.          Services                    Air Services, Other          USA  ...    1.33 - 10.70 % 1.65   22.79   2.70 % 56521
# 17  17   AAXJ  iShares MSCI All Country Asia ex Japan ETF         Financial                   Exchange Traded Fund          USA  ... - - - 60.13   1.18 % 161684
# 18  18   AAXN                       Axon Enterprise, Inc.  Industrial Goods  Aerospace/Defense Products & Services          USA  ...    0.00    0.20 % 0.77   71.11   2.37 % 187899
# 19  19     AB              AllianceBernstein Holding L.P.         Financial                       Asset Management          USA  ...    0.00   89.60 % 1.35   19.15   1.84 % 54588
# 20  20    ABB                                     ABB Ltd  Industrial Goods                  Diversified Machinery  Switzerland  ...    0.67    5.10 % 1.10   17.44   0.52 % 723739

# [20 rows x 29 columns]

I let you improve the data selection if the HTML page structure change ! The parent div id might be useful.


Explanation "[-2]": the read_html returns a list of dataframe:

list_df = pd.read_html(finviz.content)
print(type(list_df))
# <class 'list'>

# Elements types in the lists
print(type(list_df [0]))
# <class 'pandas.core.frame.DataFrame' >

So in order to get the desired dataframe, I select the 2nd element before the end with [-2]. This discussion explains about negative indexes.

Alexandre B.
  • 5,387
  • 2
  • 17
  • 40