1

When I run this it outputs a weird dataframe, saying columns are missing etc... Even though I can see the columns in the html file.

import pandas as pd
from bs4 import BeautifulSoup
import lxml.html as lh

with open("htmltabletest.html", encoding="utf-8") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'lxml')

    dfs = pd.read_html(soup.prettify())
    for df in dfs:
        print(df)

This outputs this:

   Unnamed: 0           ...                      Price  range
0         NaN           ...            $134.50  to  $2,222.50
1         NaN           ...             $20.39  to  $3,602.50

[2 rows x 5 columns]

When I have this as htmltabletest.html:

<table class="dataTable st-alternateRows" id="eventSearchTable">
<thead>
<tr>
<th id="th-es-rb"><div class="dt-th"> </div></th>
<th id="th-es-ed"><div class="dt-th"><span class="th-divider"> </span>Event date<br/>Time (local)</div></th>
<th id="th-es-en"><div class="dt-th"><span class="th-divider"> </span>Event name<br/>Venue</div></th>
<th id="th-es-ti"><div class="dt-th"><span class="th-divider"> </span>Tickets<br/>listed</div></th>
<th id="th-es-pr"><div class="dt-th es-lastCell"><span class="th-divider"> </span>Price<br/>range</div></th>
</tr>
</thead>
<tbody class="" id="eventSearchTbody"><tr class="even" id="r-se-103577924">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577924-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577924-eventDateTime">Thu, 10/11/2018<br/>8:20 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577924&amp;sectionId=0" id="se-103577924-eventName" target="_blank">Philadelphia Eagles at New York Giants</a></div><div id="se-103577924-venue">MetLife Stadium, East Rutherford, NJ</div></td>
<td id="se-103577924-nrTickets">6655</td>
<td class="es-lastCell nowrap" id="se-103577924-priceRange"><span id="se-103577924-minPrice">$134.50</span>  to<br/><span id="se-103577924-maxPrice">$2,222.50</span></td>
</tr><tr class="odd" id="r-se-103577925">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577925-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577925-eventDateTime">Thu, 10/11/2018<br/>8:21 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577925&amp;sectionId=0" id="se-103577925-eventName" target="_blank">PARKING PASSES ONLY Philadelphia Eagles at New York Giants</a></div><div id="se-103577925-venue">MetLife Stadium Parking Lots, East Rutherford, NJ</div></td>
<td id="se-103577925-nrTickets">929</td>
<td class="es-lastCell nowrap" id="se-103577925-priceRange"><span id="se-103577925-minPrice">$20.39</span>  to<br/><span id="se-103577925-maxPrice">$3,602.50</span></td>
</tr></tbody>
</table>
DYZ
  • 55,249
  • 10
  • 64
  • 93
Sparkflight
  • 51
  • 1
  • 9

3 Answers3

0

I ran your code and the print is fine. But you should also try display(df).

bowei
  • 21
  • 2
0
<tr>
<th id="th-es-rb"><div class="dt-th"> </div></th>
<th id="th-es-ed"><div class="dt-th"><span class="th-divider"> </span>Event date<br/>Time (local)</div></th>
<th id="th-es-en"><div class="dt-th"><span class="th-divider"> </span>Event name<br/>Venue</div></th>
<th id="th-es-ti"><div class="dt-th"><span class="th-divider"> </span>Tickets<br/>listed</div></th>
<th id="th-es-pr"><div class="dt-th es-lastCell"><span class="th-divider"> </span>Price<br/>range</div></th>
</tr>

Yours program works fine. Please notice that in line:

<th id="th-es-rb"><div class="dt-th"> </div></th>

you haven't any values. If you change yours input for ex.

<th id="th-es-rb"><div class="dt-th"> My new column </div></th>

It would work fine.

MY OUTPUT:

In [146]: df.columns

Out[146]: 
Index(['My new cole', 'Event date  Time (local)', 'Event name  Venue',
       'Tickets  listed', 'Price  range'],
      dtype='object')

In [145]: df

Out[145]: 
   My new cole    Event date  Time (local)  \
0          NaN  Thu, 10/11/2018  8:20 p.m.   
1          NaN  Thu, 10/11/2018  8:21 p.m.   
                                   Event name  Venue  Tickets  listed  \
0  Philadelphia Eagles at New York Giants  MetLif...             6655   
1  PARKING PASSES ONLY Philadelphia Eagles at New...              929   
             Price  range  
0  $134.50  to  $2,222.50  
1   $20.39  to  $3,602.50  
fuwiak
  • 721
  • 1
  • 8
  • 25
  • Then why is my output not including the rest of the columns & data? There should be 3 more columns, for example `Event name`, along with the event name in the first row? It shouldnt be printing NaN. – Sparkflight Aug 25 '18 at 22:49
0

The answer in my case was that I am using IDLE instead of pycharm or something else to run the program. By default pandas doesn't print wide enough to fit my data. This has already been answered in here

Sparkflight
  • 51
  • 1
  • 9