I am trying to extract content from the Stanford website using Scrapy and Xpath. The following line gets me what I want:
response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()
However, the output of the list is as follows:
[' \n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tAccounting (ACCT)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tAction Learning Programs (ALP)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tEconomic Analysis & Policy (MGTECON)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tFinance
(FINANCE)\n\t\t\t\t\t', '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tGSB General & Interdisciplinary (GSBGEN)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tHuman Resource Management
(HRMGT)\n\t\t\t\t\t', '\n\t\t\t']
As is evident, the ouput is littered with extra whitespaces with \n and \t. I don't want to iterate over the list again to remove these unwanted characters since the list is huge(truncated in for readability). I tried using Xpath's normalize space in order to fix this but it did not work.
>>>response.xpath('normalize-space(//h2[@class="schoolName"]/following-sibling::ul//text())').getall()
['']
What am i doing wrong ??