2

Im trying to extract some data on a webpage using python scrapy. I don't know enough HTML/CSS to know if this is well formatted, but it doesn't appear to be. The target information I am interested in has a pattern as shown below. A Table contains a set of entries (Name, Year, Int1, Int2) that I am interested in extracting. But these are not in the standard TD tags, instead they are part of DIV tags. here's an example:

<table width='100%'>
<tr>
<td width='50%'>
<div style='width: 10px; float: left'>&nbsp;</div>
     <div style='width: 232px; float: left'>Mr. Richard D. Hanson</div>
     <div style='width: 40px; float: left'>1989</div>
     <div style='width: 88px; float: left; text-align: right'>1</div>
     <div style='width: 88px; float: left; text-align: right'>27</div></td><td width='50%'><div style='width: 10px; float: left'>&nbsp;</div>
     <div style='width: 232px; float: left'>Alison G. Mills, CPA</div>
     <div style='width: 40px; float: left'>1989</div>
     <div style='width: 88px; float: left; text-align: right'>8</div>
     <div style='width: 88px; float: left; text-align: right'>12</div></td></tr><tr><td width='50%'><div style='width: 10px; float: left'>&nbsp;</div>
     <div style='width: 232px; float: left'>Mr. Timothy D. Harrell</div>
     <div style='width: 40px; float: left'>1989</div>
     <div style='width: 88px; float: left; text-align: right'>28</div>
     <div style='width: 88px; float: left; text-align: right'>28</div></td><td width='50%'><div style='width: 10px; float: left'>&nbsp;</div>
     <div style='width: 232px; float: left'>Debora R. Mitchell, PhD</div>
     <div style='width: 40px; float: left'>1989</div>
     <div style='width: 88px; float: left; text-align: right'>20</div>
     <div style='width: 88px; float: left; text-align: right'>21</div></td></tr><tr><td width='50%'><div style='width: 10px; float: left'>&nbsp;</div>
<div style='width: 232px; float: left'>Mr. Tim J. Scoggins</div>
     <div style='width: 40px; float: left'>1989</div>
     <div style='width: 88px; float: left; text-align: right'>1</div>
     <div style='width: 88px; float: left; text-align: right'>9</div>
</td>
</tr>
</table>

Here's what I have tried so far using the Scrapy Shell

Attempt 1:

This works, but then I need to co-relate the entries - ie get the Year and Int1 and Int2 for each Name that is accessed below

>>> response.xpath('//div[@style="width: 232px; float: left"]/text()').extract()
[u'Mr. Richard D. Hanson', u'Alison G. Mills, CPA', u'Mr. Timothy D. Harrell', u'Debora R. Mitchell, PhD', u'Mr. Tim J. Scoggins']

Attempt 2: In this attempt I am hoping to make one call to then iterate over each entry and store it in a dictionary. Unfortunately, Im not sure whats happening here

>>> response.xpath('//table[@width="100%"]/tr/td[@width="50%"]/div[@style="width: 10px; float: left"]/text()').extract()
[u'\xa0', u'\xa0', u'\xa0', u'\xa0', u'\xa0']

Any ideas?

7hacker
  • 1,928
  • 3
  • 19
  • 32

1 Answers1

1

You can get the texts of every inner div and then split the extracted list into chunks:

In [1]: data = response.xpath("//table/tr/td/div/text()").extract() 
In [2]: [data[x+1:x+5] for x in xrange(0, len(data), 5)]
Out[2]: 
[[u'Mr. Richard D. Hanson', u'1989', u'1', u'27'],
 [u'Alison G. Mills, CPA', u'1989', u'8', u'12'],
 [u'Mr. Timothy D. Harrell', u'1989', u'28', u'28'],
 [u'Debora R. Mitchell, PhD', u'1989', u'20', u'21'],
 [u'Mr. Tim J. Scoggins', u'1989', u'1', u'9']]
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Nice! I can see this works for this example here, but here's the page Im trying to extract from : http://legacy.gtalumni.org/roll_call_donors?page=100 . in there though I get an output of selector xpaths - any idea how I can dive into it further? – 7hacker Sep 19 '16 at 18:14
  • @nthacker i think the `data` in your case should be `data = response.xpath("//div[@id='ContentMiddle']/table[2]/tr/td/div/text()").extract()` - please test. – alecxe Sep 19 '16 at 18:17
  • Yup fantastic! Thanks! – 7hacker Sep 19 '16 at 18:22