Normalize space in Xpath with Python scrapy

Question

I am trying to extract content from the Stanford website using Scrapy and Xpath. The following line gets me what I want:

response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()

However, the output of the list is as follows:

[' \n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tAccounting (ACCT)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tAction Learning Programs (ALP)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tEconomic Analysis & Policy (MGTECON)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tFinance 
 (FINANCE)\n\t\t\t\t\t', '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tGSB General & Interdisciplinary (GSBGEN)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tHuman Resource Management 
  (HRMGT)\n\t\t\t\t\t', '\n\t\t\t']

As is evident, the ouput is littered with extra whitespaces with \n and \t. I don't want to iterate over the list again to remove these unwanted characters since the list is huge(truncated in for readability). I tried using Xpath's normalize space in order to fix this but it did not work.

>>>response.xpath('normalize-space(//h2[@class="schoolName"]/following-sibling::ul//text())').getall()
['']

What am i doing wrong ??

Does this answer your question? [Is it possible to apply normalize-space to all nodes XPath expression finds?](https://stackoverflow.com/questions/3359512/is-it-possible-to-apply-normalize-space-to-all-nodes-xpath-expression-finds) — Kleber Noel, Apr 04 '21 at 22:52
Also, I looked at the html of the website you are trying to scrape. You could be more precise about the nodes you want to select by adding: `li/a` e.g. `response.xpath('//h2[@class="schoolName"]/following-sibling::ul/li/a')` — Kleber Noel, Apr 04 '21 at 22:53

Kleber Noel · Answer 1 · 2021-04-07T14:20:34.363

Indexing a little deeper into your target node e.g. ./ul/li/a/text() rather than ./ul//text() fixes the empty item issue. Note that I visited the webpage you want to scrape and tried some xpaths.

Then all you have to do is apply the strip logic JaSON mentioned with something like:

map(lambda x: x.strip(), response.xpath('//h2[@class="schoolName"]/following-sibling::ul/li/a/text()'))

Also, whether normalize-space works over many nodes depends on the XPath version used in your version of scrapy. In that respect your post is a duplicate of Is it possible to apply normalize-space to all nodes XPath expression finds?

pupspulver · Answer 2 · 2021-04-05T10:43:17.467

U can use split() as an alternative to normalize-space():

list = [' \n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tAccounting (ACCT)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tAction Learning Programs (ALP)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tEconomic Analysis & Policy (MGTECON)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tFinance FINANCE)\n\t\t\t\t\t', '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tGSB General & Interdisciplinary (GSBGEN)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tHuman Resource Management (HRMGT)\n\t\t\t\t\t', '\n\t\t\t']

for x in list:
    print(x.split())

My output:

['Accounting', '(ACCT)']
[]
['Action', 'Learning', 'Programs', '(ALP)']
[]
['Economic', 'Analysis', '&', 'Policy', '(MGTECON)']
[]
['Finance', 'FINANCE)']
[]
['GSB', 'General', '&', 'Interdisciplinary', '(GSBGEN)']
[]
['Human', 'Resource', 'Management', '(HRMGT)']
[]

And then u can simply store the output values that have content in an extra list like this:

Final Code:

...

list = response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()

output = []

for x in list:
  i = x.split()
  if i:
      output.append(" ".join(i))
    
print(output)

Output:

['Accounting (ACCT)', 'Action Learning Programs (ALP)', 'Economic Analysis & Policy (MGTECON)', 'Finance FINANCE)', 'GSB General & Interdisciplinary (GSBGEN)', 'Human Resource Management (HRMGT)']

Single line solution: (based on JaSON's idea)

output = [data.strip() for data in response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall() if data.strip()]

print(output)

Output:

['Accounting (ACCT)', 'Action Learning Programs (ALP)', 'Economic Analysis & Policy (MGTECON)', 'Finance FINANCE)', 'GSB General & Interdisciplinary (GSBGEN)', 'Human Resource Management (HRMGT)']

this splits items on the basis of space which gives me the wrong result..Action Learning Program should be one value instead of three values — Amistad, Apr 04 '21 at 22:03

JaSON · Answer 3 · 2021-04-05T14:53:31.440

0

You need to use strip method to get rid of tab/new-line characters:

[text for text in [text.strip() for text in response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()] if text]

edited Apr 05 '21 at 14:53

answered Apr 05 '21 at 07:28

JaSON

4,843
2
8
15

Normalize space in Xpath with Python scrapy

3 Answers3