0

I'm trying to extract data from a table that lies in between two headers in an html file using Python. IN this case, the required id to lookup lies in a span inside a header (I need id="Perlis", which lies between Perlis and Kedah):

 <h2>
    <span class="mw-headline" id="Perlis">Perlis</span>
    <span class="mw-editsection">
      <span class="mw-editsection-bracket">[</span>
      <a href="/w/index.php?title=Results_of_the_2018_Malaysian_general_election_by_parliamentary_constituency&amp;action=edit&amp;section=3" title="Edit section: Perlis">edit</a>
      <span class="mw-editsection-bracket">]</span>
    </span>
  </h2>
  <table class="wikitable" style="text-align:center; font-size:90%; width:100%;">
    <tbody>
      <tr>
        <th width="30"># </th>
        <th width="150">Constituency s </th>
        <th width="150">Winner </th>
        <th width="80">Votes </th>
        <th width="80">Majority </th>
        <th width="150">Opponent(s) </th>
        <th width="80">Votes </th>
        <th width="150">Incumbent </th>
        <th width="80">
          <b>Incumbent Majority</b>
        </th>
      </tr>
      <tr>
        <td colspan="13">
          <a href="/wiki/Barisan_Nasional" title="Barisan Nasional">BN</a>
          <b>2</b> | <a href="/wiki/Gagasan_Sejahtera" title="Gagasan Sejahtera">GS</a>
          <b>0</b> | <a href="/wiki/Pakatan_Harapan" title="Pakatan Harapan">PH</a>
          <b>1</b> | <a href="/wiki/Independent_politician" title="Independent politician">Independent</a>
          <b>0</b>
        </td>
      </tr>
      <tr align="center">
        <td rowspan="2">P1 </td>
        <td rowspan="2">
          <a href="/wiki/Padang_Besar_(federal_constituency)" title="Padang Besar (federal constituency)">Padang Besar</a>
        </td>
        <td rowspan="2" bgcolor="#B5BED9">
          <a href="/wiki/Zahidi_Zainul_Abidin" title="Zahidi Zainul Abidin">Zahidi Zainul Abidin</a>
          <br /> ( <b>BN</b>- <b>UMNO</b>)
        </td>
        <td rowspan="2">
          <b>15,032</b>
        </td>
        <td rowspan="2">
          <b>1,438</b>
        </td>
        <td bgcolor="#F18A8F">Izizam Ibrahim <br /> ( <b>PH</b>- <b>PPBM</b>) </td>
        <td>
          <b>13,594</b>
        </td>
        <td rowspan="2" bgcolor="#B5BED9">
          <a href="/wiki/Zahidi_Zainul_Abidin" title="Zahidi Zainul Abidin">Zahidi Zainul Abidin</a>
          <br /> ( <b>BN</b>- <b>UMNO</b>)
        </td>
        <td rowspan="2">
          <b>7,426</b>
        </td>
      </tr>
      <tr>
        <td bgcolor="#B2DBB2">Mokhtar Senik <br /> ( <b>GS</b>- <b>PAS</b>) </td>
        <td>
          <b>7,874</b>
        </td>
      </tr>
      <tr align="center">
        <td rowspan="2">P2 </td>
        <td rowspan="2">
          <a href="/wiki/Kangar_(federal_constituency)" title="Kangar (federal constituency)">Kangar</a>
        </td>
        <td rowspan="2" bgcolor="#C7F2F2">Noor Amin Ahmad <br /> ( <b>PH</b>- <b>PKR</b>) </td>
        <td rowspan="2">
          <b>20,909</b>
        </td>
        <td rowspan="2">
          <b>5,603</b>
        </td>
        <td bgcolor="#B5BED9">Ramli Shariff <br /> ( <b>BN</b>- <b>UMNO</b>) </td>
        <td>
          <b>15,306</b>
        </td>
        <td rowspan="2" bgcolor="#B5BED9">
          <a href="/wiki/Shaharuddin_Ismail" title="Shaharuddin Ismail">Shaharuddin Ismail</a>
          <br /> ( <b>BN</b>- <b>UMNO</b>)
        </td>
        <td rowspan="2">
          <b>4,037</b>
        </td>
      </tr>
      <tr>
        <td bgcolor="#B2DBB2">Mohamad Zahid Ibrahim <br /> ( <b>GS</b>- <b>PAS</b>) </td>
        <td>
          <b>8,465</b>
        </td>
      </tr>
    </tbody>
  </table>
  <h2>
    <span class="mw-headline" id="Kedah">Kedah</span>
    <span class="mw-editsection">
      <span class="mw-editsection-bracket">[</span>
      <a href="/w/index.php?title=Results_of_the_2018_Malaysian_general_election_by_parliamentary_constituency&amp;action=edit&amp;section=4" title="Edit section: Kedah">edit</a>
      <span class="mw-editsection-bracket">]</span>
    </span>
  </h2>
  <table class="wikitable" style="text-align:center; font-size:90%; width:100%;"></table>

This is the resulting JSON that I am trying to construct:

[
  {
    "state": "Perlis",
    "constituencies": [
      {
        "id": "P1",
        "name": "Padang Besar"
      },
      {
        "id": "P2",
        "name": "Kangar"
      }
    ]
  }
]

I'd like to know how to reference the specific table so I can extract the data into a JSON format. I have used Scrapy before but not sure how to in this case- this is what I had in mind:

class PostSpider(scrapy.Spider):

    name = 'manual_spider'

    start_urls = [
        '%URL%'
    ]

    def parse(self, response):

        doc = response.xpath('//comment()').getall() //This is the bit I need

//code continues here
Janez Kuhar
  • 3,705
  • 4
  • 22
  • 45
clattenburg cake
  • 1,096
  • 3
  • 19
  • 40
  • How about simply selecting the first table after title **Perlis**? Constructing an Xpath query that guarantees a certain element lies *between* 2 elements can be [cumbersome](https://stackoverflow.com/q/10859703/6367213) (dare I say not possible in your case?)! – Janez Kuhar Sep 06 '21 at 18:04
  • This question is similar as well: [XPATH substring before and after to return text between two html tags](https://stackoverflow.com/q/69084889/6367213) – Janez Kuhar Sep 07 '21 at 10:50

0 Answers0