0

I am parsing semi-structured text documents (sec filings) using beautifulsoup. The table I am looking for looks like this:

<table id="c1217ce3e2ce4613a7595102fa855c49" style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; WIDTH: 100%; BORDER-COLLAPSE: collapse" cellspacing="0" cellpadding="0">
<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top; BORDER-BOTTOM: #000000 2px solid">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; FONT-WEIGHT: bold; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Name</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top; BORDER-BOTTOM: #000000 2px solid">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; FONT-WEIGHT: bold; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">Age</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top; BORDER-BOTTOM: #000000 2px solid">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; FONT-WEIGHT: bold; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Position(s)</div>
</td>
</tr>

<tr>
<td style="BORDER-TOP: #000000 2px solid; WIDTH: 27.7%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Michael Reger</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="BORDER-TOP: #000000 2px solid; WIDTH: 6.23%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">40</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="BORDER-TOP: #000000 2px solid; WIDTH: 60.58%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director and Chief Executive Officer</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Lisa Bromiley</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">43</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Robert Grabb</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">64</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Delos Cy Jamison</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">66</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Jack King</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">63</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Richard Weber</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">52</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director and Chairman of the Board</div>
</td>
</tr>
</table>

I am using the following code to extract all of the text document's table into a variable

from bs4 import BeautifulSoup

html = open("/sec_gov/Archives/edgar/data/1104485/0001104485-16-000061.txt",'r')
soup = BeautifulSoup(html, 'html.parser')
tables = [
    [
        [td.get_text(strip=True) for td in tr.find_all('td')] 
        for tr in table.find_all('tr')
    ] 
    for table in soup.find_all('table')

]

print (tables)

Now I get an array with all the tables in the document:

[[['', '1)', 'Title of each class of securities to which transaction applies:', '']], [['', '2)', 'Aggregate number of securities to which transaction applies:', '']], [['', '3)', 'Per unit price or other underlying value of transaction computed pursuant to Exchange Act Rule 0-11 (set forth the amount on which the filing fee is calculated and state how it was determined):']], [['', '4)', 'Proposed maximum aggregate value of transaction:', '']], [['', '5)', 'Total fee paid:', '']], [['£', 'Check box if any part of the fee is offset as provided by Exchange Act Rule 0-11(a)(2) and identify the filing for which the offsetting fee was paid previously.\xa0 Identify the previous filing by registration statement number, or the form or schedule and the date of its filing.']], [['', '1)', 'Amount Previously Paid:', '']], [['', '2)', 'Form, Schedule or Registration Statement No.:', '']], [['', '3)', 'Filing Party:', '']], [['', '4)', 'Date Filed:', '']], [['1.', 'To elect six directors to serve until the Annual Meeting of Shareholders in 2017;']], [['2.', 'To ratify the appointment ofGrant Thornton LLPas our independent registered public accounting firm for the fiscal year ending December 31, 2016;']], [['3.', 'To approve an amendment to our Articles of Incorporation to increase the number of authorized shares of common stock;']], [['4.', 'To approve an amendment to add shares to our 2013 Incentive Plan; and']], [['5.', 'To approve, by a non-binding advisory vote, the compensation paid to our named executive officers.']], [['', 'Page'], ['THE ANNUAL MEETING', '1'], ['VOTING INSTRUCTIONS', '2'], ['CORPORATE GOVERNANCE', '4'], ['SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND MANAGEMENT', '8'], ['SECTION 16(a) BENEFICIAL OWNERSHIP REPORTING COMPLIANCE', '10'], ['PROPOSAL 1: ELECTION OF DIRECTORS', '10'], ['PROPOSAL 2: RATIFICATION OF APPOINTMENT OF INDEPENDENT REGISTERED PUBLIC ACCOUNTANTS', '13'], ['AUDIT COMMITTEE REPORT', '16'], ['PROPOSAL 3:\xa0 APPROVE AN AMENDMENT TO OUR ARTICLES OF INCORPORATION TO INCREASE THE NUMBER OF AUTHORIZED SHARES OF COMMON STOCK', '17'], ['PROPOSAL 4: APPROVE AN AMENDMENT TO ADD SHARES TO THE 2013 INCENTIVE PLAN', '19'], ['PROPOSAL 5: NONBINDING ADVISORY VOTE TO APPROVE THE COMPENSATION OF THE NAMED EXECUTIVE OFFICERS', '29'], ['EXECUTIVE COMPENSATION', '30'], ['CERTAIN RELATIONSHIPS AND RELATED TRANSACTIONS', '54'], ['NORTHERN OIL AND GAS, INC. FORM 10-K', '55'], ['HOUSEHOLDING', '55'], ['SHAREHOLDER PROPOSALS FOR 2017 ANNUAL MEETING', '55'], ['OTHER MATTERS', '55']], [['§', 'by filing a written notice of revocation with our corporate secretary prior to commencement of the Annual Meeting;']], [['§', 'by submitting another proper proxy with a more recent date than that of the proxy first given by signing, dating and returning a proxy card to our company by mail; or']], [['§', 'by attending the Annual Meeting and voting in person.']], [['Name', '', 'Audit Committee', '', 'Compensation Committee', '', 'Nominating Committee', '', 'Independent Directors'], ['Lisa Bromiley', '', '✓*', '', '✓*', '', '', '', '✓'], ['Robert Grabb', '', '✓', '', '', '', '✓', '', '✓'], ['Delos Cy Jamison', '', '✓', '', '', '', '✓', '', '✓'], ['Jack King', '', '', '', '✓', '', '✓*', '', '✓'], ['Michael Reger', '', '', '', '', '', '', '', ''], ['Richard Weber', '', '', '', '✓', '', '', '', '✓+']], [['*', 'Denotes committee chairperson.']], [['+', 'Mr. Weber has served as chairman of the board of directors since January 2016.']], [['Name(1)', '', 'Number ofShares', '', '', 'Percent ofCommon Stock', ''], ['Certain Beneficial Owners:', '', '', '', '', '', ''], ['BlackRock, Inc.55 East 52ndStreet, New York, NY 10055', '', '', '5,498,238', '(2)', '', '', '8.6', '%'], ['Fine Capital Partners, L.P.590 Madison Avenue, 27thFloor, New York, NY 10022', '', '', '6,228,555', '(3)', '', '', '9.8', '%'], ['FMR LLC245 Summer Street, Boston, MA 02210', '', '', '6,074,233', '(4)', '', '', '9.5', '%'], ['TRT Holdings, Inc.4001 Maple Ave., Suite 600, Dallas, TX 75219', '', '', '12,461,885', '(5)', '', '', '19.6', '%'], ['The Vanguard Group100 Vanguard Blvd., Malvern, PA 19355', '', '', '4,332,562', '(6)', '', '', '6.8', '%'], ['Directors and Executive Officers:', '', '', '', '', '', '', '', ''], ['Michael Reger', '', '', '4,484,882', '(7)', '', '', '7.0', '%'], ['LisaBromiley', '', '', '105,972', '(8)', '', '', '*', ''], ['Robert Grabb', '', '', '138,675', '', '', '', '*', ''], ['Delos Cy Jamison', '', '', '38,937', '', '', '', '*', ''], ['Jack King', '', '', '134,486', '(9)', '', '', '*', ''], ['Richard Weber', '', '', '312,715', '(10)', '', '', '*', ''], ['Thomas Stoelk', '', '', '481,662', '', '', '', '*', ''], ['Brandon Elliott', '', '', '186,948', '', '', '', '*', ''], ['Erik Romslo', '', '', '219,931', '', '', '', '*', ''], ['Darrell Finneman (former Executive Officer)', '', '', '51,610', '', '', '', '*', ''], ['Directors and Current Executive Officers as a Group (9 persons)', '', '', '6,104,208', '(11)', '', '', '9.5', '%']], [['*', 'Denotes less than 1% ownership.']], [['(1)', 'As used in this table, "beneficial ownership" means the sole or shared power to vote, or to direct the voting of, a security, or the sole or shared investment power with respect to a security (i.e., the power to dispose of, or to direct the disposition of, a security).\xa0 The address of each member of management and each director is care of our company.']], [['(2)', 'The number of shares indicated is based on information reported to the SEC in an amended Schedule 13G filed by BlackRock, Inc. on January 27, 2016, and reflects beneficial ownership as of December 31, 2015.\xa0 BlackRock, Inc. has sole voting power with respect to 5,348,217 shares and sole dispositive power with respect to 5,498,238 shares.']], [['(3)', 'The number of shares indicated is based on information reported to the SEC in a Schedule 13G filed by Fine Capital Partners, L.P. on February 16, 2015, and reflects beneficial ownership as of December 31, 2015.\xa0 Fine Capital Partners, L.P., Fine Capital Advisors, LLC and Debra Fine have shared voting power with respect to 6,228,555 shares and shared dispositive power with respect to 6,228,555 shares.']], [['(4)', 'The number of shares indicated is based on information reported to the SEC in an amended Schedule 13G filed by FMR LLC on March 10, 2016, and reflects beneficial ownership as of February 29, 2016.\xa0 FMR LLC has no sole voting power and has sole dispositive power with respect to 6,074,233 shares.\xa0 Members of the Johnson family, including Abigail P. Johnson, Director, Vice Chairman, CEO and President of FMR LLC, are the predominant owners, directly or through trusts, of Series B voting common shares of FMR LLC, representing 49% of the voting power of FMR LLC.\xa0 The Johnson family group and all other Series B shareholders have entered into a shareholders\' voting agreement under which all Series B voting common shares will be voted in accordance with the majority vote of Series B voting common shares.\xa0 Accordingly, through their ownership of voting common shares and the execution of the shareholders\' voting agreement, members of the Johnson family may be deemed to form a controlling group with respect to FMR LLC.\xa0 Neither FMR LLC nor Abigail P. Johnson has the sole power to vote or direct the voting of the shares owned directly by various investment companies (the "Fidelity Funds") advised by Fidelity Management & Research Company ("FMR Co"), a wholly owned subsidiary of FMR LLC, which power resides with the Fidelity Funds\' Boards of Trustees.\xa0 FMR Co carries out the voting of the shares under written guidelines established by the Fidelity Funds\' Boards of Trustees.']], [['(5)', 'The information is based on information reported to the SEC in an Amended Schedule 13D filed by TRT Holdings, Inc., Cresta Investments, LLC, Cresta Greenwood, LLC and Robert B. Rowling (the "Reporting Persons") on November 28, 2014, as amended on February 24, 2016, as well as additional information reported to the SEC on a Form 4 filed on behalf of Robert B. Rowling on February 26, 2016.The Reporting Persons beneficially own, in the aggregate, 12,461,885 common shares.TRT Holdings, Inc. has sole voting power and sole dispositive power with respect to 7,169,741 shares.\xa0 Cresta Investments, LLC has sole voting power and sole dispositive power with respect to 3,947,921 shares.\xa0 Cresta Greenwood, LLC has sole voting power and sole dispositive power with respect to 1,344,223 shares.\xa0 Mr. Rowlingbeneficially owns all 12,461,885 common shares held directly by TRT Holdings, Inc., Cresta Investments, LLC and Cresta Greenwood, LLC.Mr. Rowlingbeneficially owns the common shares held directly by TRT Holdings, Inc. due to his ownership of all of the shares of Class B Common Stock of TRT Holdings, Inc.Mr. Rowlingbeneficially owns the common shares held directly by Cresta Investments, LLC and Cresta Greenwood, LLC due to his direct and indirect ownership of 100% of the ownership interests in such entities.']], [['(6)', 'Thenumberof shares indicated is based on information reported to the SEC in an amended Schedule 13G filed by The Vanguard Group on February 11, 2016, and reflects beneficial ownership as of December 31, 2015.\xa0 The Vanguard Group has sole voting power with respect to 74,199 shares, sole dispositive power with respect to 4,332,562 shares and shared dispositive power with respect to 69,499 shares.\xa0 Vanguard Fiduciary Trust Company ("VFTC"), a wholly-owned subsidiary of The Vanguard Group, Inc., is the beneficial owner of 69,499 shares as a result of its serving as investment manager of collective trust accounts.\xa0 Vanguard Investments Australia, Ltd. ("VIA"), a wholly-owned subsidiary of The Vanguard Group, Inc., is the beneficial owner of 4,700 shares as a result of its serving as investment manager of Australian investment offerings.']], [['(7)', "Includes 1,000 shares held by Mr. Reger's spouse."]], [['(8)', 'Includes 55,872 shares subject to options held by Ms.Bromiley.']], [['(9)', 'Includes 86,000 shares subject to options held by Mr. King.']], [['(10)', 'Includes 250,000 shares subject to options held by Mr. Weber.']], [['(11)', "Consists of all shares held by directors and current executive officers at March 31, 2016.\xa0 Includes 1,000 shares held by Mr. Reger's spouse, and an aggregate of 391,872 shares covered by options held by our directors."]], [['Name', '', 'Age', '', 'Position(s)'], ['Michael Reger', '', '40', '', 'Director and Chief Executive Officer'], ['Lisa Bromiley', '', '43', '', 'Director'], ['Robert Grabb', '', '64', '', 'Director'], ['Delos Cy Jamison', '', '66', '', 'Director'], ['Jack King', '', '63', '', 'Director'], ['Richard Weber', '', '52', '', 'Director and Chairman of the Board']]

As you can see, there are many tables. I am looking for a specific table that includes the columns "name" and "position." Specifically, I am trying to:

  • Loop through all the tables extracted from the document
  • Select the table that includes the columns "name" and "position"
  • Extract that table in another variable

How can I do this?

user1029296
  • 609
  • 8
  • 17

3 Answers3

0

You can use pandas to get it:

import pandas as pd

html= """<table id="c1217ce3e2ce4613a7595102fa855c49" style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; WIDTH: 100%; BORDER-COLLAPSE: collapse" cellspacing="0" cellpadding="0">
<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top; BORDER-BOTTOM: #000000 2px solid">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; FONT-WEIGHT: bold; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Name</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top; BORDER-BOTTOM: #000000 2px solid">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; FONT-WEIGHT: bold; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">Age</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top; BORDER-BOTTOM: #000000 2px solid">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; FONT-WEIGHT: bold; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Position(s)</div>
</td>
</tr>

<tr>
<td style="BORDER-TOP: #000000 2px solid; WIDTH: 27.7%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Michael Reger</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="BORDER-TOP: #000000 2px solid; WIDTH: 6.23%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">40</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="BORDER-TOP: #000000 2px solid; WIDTH: 60.58%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director and Chief Executive Officer</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Lisa Bromiley</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">43</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Robert Grabb</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">64</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Delos Cy Jamison</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">66</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Jack King</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">63</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top; BACKGROUND-COLOR: #cbe9fd">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director</div>
</td>
</tr>

<tr>
<td style="WIDTH: 27.7%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Richard Weber</div>
</td>
<td style="WIDTH: 2.52%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 6.23%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: center; LINE-HEIGHT: 12.55pt">52</div>
</td>
<td style="WIDTH: 2.97%; VERTICAL-ALIGN: top">&#160;</td>
<td style="WIDTH: 60.58%; VERTICAL-ALIGN: top">
<div style="FONT-SIZE: 10pt; FONT-FAMILY: 'Times New Roman', Times, serif; TEXT-ALIGN: justify; LINE-HEIGHT: 12.55pt">Director and Chairman of the Board</div>
</td>
</tr>
</table>"""

df = pd.read_html(html)
print(df)

OUTPUT:

[                  0   1    2   3                                     4
0              Name NaN  Age NaN                           Position(s)
1     Michael Reger NaN   40 NaN  Director and Chief Executive Officer
2     Lisa Bromiley NaN   43 NaN                              Director
3      Robert Grabb NaN   64 NaN                              Director
4  Delos Cy Jamison NaN   66 NaN                              Director
5         Jack King NaN   63 NaN                              Director
6     Richard Weber NaN   52 NaN    Director and Chairman of the Board]

Then you can check and get all the wanted datas inside the table using the pandas doc

EDIT:

You can write this snippet in a function and return the df in place of the print

dfs = pd.read_html(html)
for df in dfs:
    for _, row in df.iterrows():
        for value in row:
            if value == 'Name' or value == 'Position(s)':
                print("You found the table")
Maaz
  • 2,405
  • 1
  • 15
  • 21
  • Thanks for your help. The problem is that beautifulsoup extracts many tables from the document. I need to keep only the one that has columns for "name" and "position" How can I do that? – user1029296 Jul 29 '19 at 15:24
  • Sorry I was unavailable. Don't know if you found yet, but you can take a look at my edit ;-) – Maaz Jul 30 '19 at 06:54
0

For selecting <table> that contains columne Name and Position you can use CSS selector table:has(td:contains(Name)):has(td:contains(Position)):

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')   # data is your HTML code snipped in question

rows = []
for tr in soup.select('table:has(td:contains(Name)):has(td:contains(Position)) tr'):
    rows.append([td.get_text(strip=True) for td in tr.select('td') if td.get_text(strip=True)])

for row in rows:
    print(('{: <20}'*len(row)).format(*row))

Prints:

Name                Age                 Position(s)         
Michael Reger       40                  Director and Chief Executive Officer
Lisa Bromiley       43                  Director            
Robert Grabb        64                  Director            
Delos Cy Jamison    66                  Director            
Jack King           63                  Director            
Richard Weber       52                  Director and Chairman of the Board

The CSS selector table:has(td:contains(Name)):has(td:contains(Position)) means: select all <table> that has <td> that contains "Name" and <td> that contains "Position".

Further reading:

CSS Selector Reference

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you so much for your help. I was thinking of using fuzzywuzzy to account for slight changes in titles (e.g. "name" vs. "names", etc). Do you think I can implement fuzzywuzzy somewhere in your suggestion? Thanks a lot. – user1029296 Jul 29 '19 at 15:45
  • @user1029296 I have no experience with `fuzzywuzzy` but you can try `soup.find_all(lambda tag: tag.name=='td' and call_to_fuzzywuzzy(tag.text) >= ?)` to select all `` by some fuzzywuzzy weight. – Andrej Kesely Jul 29 '19 at 15:51
  • Thank you so much! Would you mind telling me how i can store this table in a variable? Hopefully panda? – user1029296 Jul 29 '19 at 16:58
  • @user1029296 All your data is in variable `rows`, so something along the lines of `pd.DataFrame(rows).T.set_index(0).T` – Andrej Kesely Jul 29 '19 at 17:00
  • I tried but it seems to only store one column. Any suggestion? Ideally I would only keep the "name" and "position" columns – user1029296 Jul 29 '19 at 17:38
  • @user1029296 Try look at these answers: https://stackoverflow.com/questions/19112398/getting-list-of-lists-into-pandas-dataframe – Andrej Kesely Jul 29 '19 at 17:42
0

This should return the first table that contains 'Name' and 'Position' in its columns. For each table, it creates a list of its columns (assuming the table structure is as in your example), then returns the first table that has the required columns in the data_table variable.

soup = BeautifulSoup(html, features="html.parser")
tables = soup.find_all('table')
for table in tables:
    columns = []
    for child in table.findChild().findChildren():
        columns.append(child.text)
    if ('Name' in columns) and ('Position' in columns):
        data_table = table
        break
Richard
  • 494
  • 7
  • 18
  • Hi and thank you for your help! I am trying to print the table but nothing comes out. Am I doing something wrong by just calling "print(table)" in the loop? – user1029296 Jul 29 '19 at 16:57
  • The table in the loop will be some kind of BeautifulSoup object. To print, you might need to convert it to a string. I haven't checked but print(str(soup)) should work. – Richard Jul 30 '19 at 08:55