I'm trying to extract data from a table that lies in between two headers in an html file using Python. IN this case, the required id to lookup lies in a span
inside a header (I need id="Perlis"
, which lies between Perlis
and Kedah
):
<h2>
<span class="mw-headline" id="Perlis">Perlis</span>
<span class="mw-editsection">
<span class="mw-editsection-bracket">[</span>
<a href="/w/index.php?title=Results_of_the_2018_Malaysian_general_election_by_parliamentary_constituency&action=edit&section=3" title="Edit section: Perlis">edit</a>
<span class="mw-editsection-bracket">]</span>
</span>
</h2>
<table class="wikitable" style="text-align:center; font-size:90%; width:100%;">
<tbody>
<tr>
<th width="30"># </th>
<th width="150">Constituency s </th>
<th width="150">Winner </th>
<th width="80">Votes </th>
<th width="80">Majority </th>
<th width="150">Opponent(s) </th>
<th width="80">Votes </th>
<th width="150">Incumbent </th>
<th width="80">
<b>Incumbent Majority</b>
</th>
</tr>
<tr>
<td colspan="13">
<a href="/wiki/Barisan_Nasional" title="Barisan Nasional">BN</a>
<b>2</b> | <a href="/wiki/Gagasan_Sejahtera" title="Gagasan Sejahtera">GS</a>
<b>0</b> | <a href="/wiki/Pakatan_Harapan" title="Pakatan Harapan">PH</a>
<b>1</b> | <a href="/wiki/Independent_politician" title="Independent politician">Independent</a>
<b>0</b>
</td>
</tr>
<tr align="center">
<td rowspan="2">P1 </td>
<td rowspan="2">
<a href="/wiki/Padang_Besar_(federal_constituency)" title="Padang Besar (federal constituency)">Padang Besar</a>
</td>
<td rowspan="2" bgcolor="#B5BED9">
<a href="/wiki/Zahidi_Zainul_Abidin" title="Zahidi Zainul Abidin">Zahidi Zainul Abidin</a>
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>15,032</b>
</td>
<td rowspan="2">
<b>1,438</b>
</td>
<td bgcolor="#F18A8F">Izizam Ibrahim <br /> ( <b>PH</b>- <b>PPBM</b>) </td>
<td>
<b>13,594</b>
</td>
<td rowspan="2" bgcolor="#B5BED9">
<a href="/wiki/Zahidi_Zainul_Abidin" title="Zahidi Zainul Abidin">Zahidi Zainul Abidin</a>
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>7,426</b>
</td>
</tr>
<tr>
<td bgcolor="#B2DBB2">Mokhtar Senik <br /> ( <b>GS</b>- <b>PAS</b>) </td>
<td>
<b>7,874</b>
</td>
</tr>
<tr align="center">
<td rowspan="2">P2 </td>
<td rowspan="2">
<a href="/wiki/Kangar_(federal_constituency)" title="Kangar (federal constituency)">Kangar</a>
</td>
<td rowspan="2" bgcolor="#C7F2F2">Noor Amin Ahmad <br /> ( <b>PH</b>- <b>PKR</b>) </td>
<td rowspan="2">
<b>20,909</b>
</td>
<td rowspan="2">
<b>5,603</b>
</td>
<td bgcolor="#B5BED9">Ramli Shariff <br /> ( <b>BN</b>- <b>UMNO</b>) </td>
<td>
<b>15,306</b>
</td>
<td rowspan="2" bgcolor="#B5BED9">
<a href="/wiki/Shaharuddin_Ismail" title="Shaharuddin Ismail">Shaharuddin Ismail</a>
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>4,037</b>
</td>
</tr>
<tr>
<td bgcolor="#B2DBB2">Mohamad Zahid Ibrahim <br /> ( <b>GS</b>- <b>PAS</b>) </td>
<td>
<b>8,465</b>
</td>
</tr>
</tbody>
</table>
<h2>
<span class="mw-headline" id="Kedah">Kedah</span>
<span class="mw-editsection">
<span class="mw-editsection-bracket">[</span>
<a href="/w/index.php?title=Results_of_the_2018_Malaysian_general_election_by_parliamentary_constituency&action=edit&section=4" title="Edit section: Kedah">edit</a>
<span class="mw-editsection-bracket">]</span>
</span>
</h2>
<table class="wikitable" style="text-align:center; font-size:90%; width:100%;"></table>
This is the resulting JSON that I am trying to construct:
[
{
"state": "Perlis",
"constituencies": [
{
"id": "P1",
"name": "Padang Besar"
},
{
"id": "P2",
"name": "Kangar"
}
]
}
]
I'd like to know how to reference the specific table so I can extract the data into a JSON format. I have used Scrapy before but not sure how to in this case- this is what I had in mind:
class PostSpider(scrapy.Spider):
name = 'manual_spider'
start_urls = [
'%URL%'
]
def parse(self, response):
doc = response.xpath('//comment()').getall() //This is the bit I need
//code continues here