0
scrapy shell http://www.zvon.org/comp/r/tut-XPath_1.html
response.css("div.description")
response.xpath('//div[@class="description"]')

I am a newbie of scrapy, when I want to write a spider by myself, I tried to crawl the text from http://www.zvon.org/comp/r/tut-XPath_1.html, includeing the description text and the right bar text, in order to make the next page url, I have spend 5 hours,but I am failed to write the right CSS or Xpath,such as the xpath of

<div class="right_menu_body_item">List of XPaths</div>

and

<div class="description">XPath is described in <a href="http://www.w3.org/TR/xpath" target="_blank" id="cglh" title="XPath 1.0 standard">XPath 1.0 standard</a>. 

anyone can help? thanks!

<script type="text/javascript"> 
 
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-15189975-1']);
  _gaq.push(['_trackPageview']);
 
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(ga);
  })();
 
</script>   

  <div id="page">
    <div id="top"><h1 class="top">XPath 1.0 Tutorial</h1>
    </div>
    <div id="right" style="width: 230px;"><div style="width:234px; margin-top:10px; height:60px;background:url(http://www.highposition.net/embedded/img/234x60-hpbg.png);color#fff" id="hpban">                
                <div style="padding:9px;padding-left:78px;font-family:arial;color:#fff;font-size:11px;">For a flurry of SEO tips, tricks, articles and advice - visit <a href="http://www.hpgroup-seo.co.uk" rel="nofollow">HP Group</a>.</div></div>

<div id="right_menu_header">
<div class="right_menu_header_item right_menu_header_item_selected">
Pages
<span id="header_count_Pages" style="font-style:italic; font-weight:normal">(23)</span></div>
<div class="right_menu_header_item">
Keywords
<span id="header_count_Keywords" style="font-style:italic; font-weight:normal">(34)</span></div>
<div id="filter_div">filter: <input name="right_menu_filter" id="right_menu_filter"></div>
<div class="filter_div_comment"><input name="regexpEnabled" id="regexpEnabled" type="checkbox">enable regexp (<a href="/comp/r/zvon.html#Help~Filter">?</a>)</div>
</div>
<div id="right_menu_body"><div class="pn_right_menu_body_ttt"><span class="right_menu_body_first_passive">First</span> - <span class="right_menu_body_prev_passive">Prev</span> - <span class="right_menu_body_next">Next</span></div>
<div id="right_menu_body_head">
1
-
20
<span style="color:red; font-weight:bold">filter: off</span> (23)
</div>
**<div class="right_menu_body_item">List of XPaths</div>**
<div class="right_menu_body_item">XPath as filesystem addressing</div>
<div class="right_menu_body_item">Start with //</div>
<div class="right_menu_body_item">All elements: *</div>
<div class="right_menu_body_item">Further conditions inside []</div>
<div class="right_menu_body_item">Attributes</div>
<div class="right_menu_body_item">Attribute values</div>
<div class="right_menu_body_item">Nodes counting</div>
<div class="right_menu_body_item">Playing with names of selected elements</div>
<div class="right_menu_body_item">Length of string</div>
<div class="right_menu_body_item">Combining XPaths with |</div>
<div class="right_menu_body_item">Child axis</div>
<div class="right_menu_body_item">Descendant axis</div>
<div class="right_menu_body_item">Parent axis</div>
<div class="right_menu_body_item">Ancestor axis</div>
<div class="right_menu_body_item">Following-sibling axis</div>
<div class="right_menu_body_item">Preceding-sibling axis</div>
<div class="right_menu_body_item">Following axis</div>
<div class="right_menu_body_item">Preceding axis</div>
<div class="right_menu_body_item">Descendant-or-self axis</div>
<div class="pn_right_menu_body_bbb"><span class="right_menu_body_first_passive">First</span> - <span class="right_menu_body_prev_passive">Prev</span> - <span class="right_menu_body_next">Next</span></div></div></div>
    <div id="left">
      <div id="search_div"><div><input id="search_input" name="search_input" value="...loading..."> <a href="http://fusion.google.com/add?source=atgs&amp;moduleurl=http%3A//zvon.org/gadgets/zvon_keywords.xml"><img id="plus_google" src="http://gmodules.com/ig/images/plus_google.gif" style="margin:2px" alt="Add to Google" border="0"></a><div id="search_input_text"></div></div><div id="result_div"></div></div>

      <div id="hint_div">
 ⇒ interactive index to zvon materials
      </div>
      
      <div id="category_logo_div">
 <table id="category-table">
   <tbody><tr>
     <td id="category-switch">
       <img src="/shared/png/comp.png" height="66" width="70">
     </td>
     <td id="category-switch-links">
       <div class="category-div">
  <a href="/" id="switch-comp" class="switch-selected">
    comp
    <img src="/shared/png/comp_small.png" title="computing resources" style="display: none;" height="15" width="16">
  </a>
       </div>
       <div class="category-div">
  <a href="/law" id="switch-law">
    law
    <img src="/shared/png/law_small.png" title="international law documents" height="15" width="16">
  </a>
       </div>
       <div class="category-div">
  <a href="/lib" id="switch-lib">
    lib
    <img src="/shared/png/lib_small.png" title="resources for librarians" height="15" width="16">
  </a>
       </div>
       <div class="category-div">
  <a href="/eco" id="switch-eco">
    eco
    <img src="/shared/png/eco_small.png" title="eco resources" height="15" width="16">
  </a>
       </div>
     </td>
   </tr>
 </tbody></table>
      </div>

    <div id="center" style="width: 500px;">
      <div id="noscript" style="display: none;"><div id="noscript_intro">XPath is described in <a href="http://www.w3.org/TR/xpath" target="_blank" id="cglh" title="XPath 1.0 standard">XPath 1.0 standard</a>. In this tutorial selected XPath features are demonstrated on many examples.<br> <br> <div> <b>Standard excerpt:</b> </div> <blockquote class="webkit-indent-blockquote" style="BORDER:none;MARGIN:0 0 0 40px"> <div> XPath is the result of an effort to provide a common syntax and semantics for functionality shared between XSL Transformations and XPointer. The primary purpose of XPath is to address parts of an XML document. In support of this primary purpose, it also provides basic facilities for manipulation of strings, numbers and booleans. XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document. </div> </blockquote> <br> Zvon offers other <a href="/comp/m/xpath.html" target="_blank" title="XPath related materials">XPath related materials</a>.<br> <br> <b><br> </b> <div> <b>Prepared by:</b> Miloslav Nic (Mila)<span id="nicmila_details"></span> </div> <br></div></div>
      <div id="center_top"></div>
      <div id="center_middle"><h1 id="browser_title_line">XPath 1.0 Tutorial</h1><div id="prevNextDiv"><span id="backPageSpanPassive">Back</span>|<span id="forwardPageSpanPassive">Forward</span>||<span id="prevPageSpanPassive">Previous</span>|<span id="nextPageSpan">Next</span></div>**<div class="description">XPath is described in <a href="http://www.w3.org/TR/xpath" target="_blank" id="cglh" title="XPath 1.0 standard">XPath 1.0 standard</a>. In this tutorial selected XPath features are demonstrated on many examples.<br> <br> <div> <b>Standard excerpt:</b> </div> <blockquote class="webkit-indent-blockquote" style="BORDER:none;MARGIN:0 0 0 40px"> <div></div> </blockquote> <br> Zvon offers other <a href="/comp/m/xpath.html" target="_blank" title="XPath related materials">XPath related materials</a> XPath is the result of an effort to provide a common syntax and semantics for functionality shared between XSL Transformations and XPointer. The primary purpose of XPath is to address parts of an XML document. In support of this primary purpose, it also provides basic facilities for manipulation of strings, numbers and booleans. XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document. .<br> <br> <b><br> </b>** <div> <b>Prepared by:</b> Miloslav Nic (Mila)<span id="nicmila_details"></span> </div> <br></div><div id="prevNextDivBottom"><span id="prevPageSpanPassive">Previous</span>|<span id="nextPageSpan">Next</span></div></div>
      <div id="center_bottom"><h2 class="bottom">XPath 1.0 Tutorial</h2><div id="front_keywords"><i>keywords</i>: <a href="/comp/m/programming.html">programming</a>, <a href="/comp/m/tutorial.html">tutorial</a>, <a href="/comp/m/xml.html">XML</a>, <a href="/comp/m/xpath.html">XPath</a></div> </div>
    </div>
    <div id="bottom"></div>

      <div id="example_div">
 <div id="example_menu_div" class="windowMenu">
   <span id="close_example_span" class="windowMenuButton">x</span>
   <span id="example_title_text" class="windowMenuText"></span>
 </div>
 <div id="example_body_div"></div>
      </div>
  </div>
  <script type="text/javascript" src="http://www.google.com/jsapi"></script>
  <script type="text/javascript">google.load("jquery", "1");</script><script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"></script>
  <!--script src="/Javascript/jquery.min.js"></script-->
  <script type="text/javascript" src="/Javascript/zvon.js"></script>
  <script type="text/javascript">release="20100406"</script>
  <script type="text/javascript">indexes={"_gadget": false, "_examples": [], "_display_format": {"Keywords": {"tp": "keyword", "title": ["name"]}, "Pages": {"tp": "page", "title": ["name"]}}, "_indexes": [["Pages", "page"], ["Keywords", "keyword"]], "Pages": ["List of XPaths", "XPath as filesystem addressing", "Start with //", "All elements: *", "Further conditions inside []", "Attributes", "Attribute values", "Nodes counting", "Playing with names of selected elements", "Length of string", "Combining XPaths with |", "Child axis", "Descendant axis", "Parent axis", "Ancestor axis", "Following-sibling axis", "Preceding-sibling axis", "Following axis", "Preceding axis", "Descendant-or-self axis", "Ancestor-or-self axis", "Orthogonal axes", "Numeric operations"], "_matID": "tut-XPath_1", "Keywords": ["", "&gt;", "&lt;", "*", "/", "//", "=", "@", "[]", "absolute path", "ancestor", "attribute", "axis", "ceiling", "child", "contains", "count", "descendant", "div", "division", "floor", "following", "last", "name", "normalize-space", "not", "parent", "preceding", "self", "sibling", "starts-with", "string", "string-length", "|"], "_title": "XPath 1.0 Tutorial"}</script>
  
<!-- script src="/Javascript/zvon_browser.js"></script>
<script src="/Javascript/zvon_xmlbrowser.js"></script -->
<!--script type="text/javascript">
  $.get('http://c.zvon.org/counter/'+encodeURIComponent(window.location));
</script-->



  <div id="dynamic_div" style="top: 100.133px; left: 259.5px;">
    <div id="dynamic_menu_div" class="windowMenu">
      <span id="close_dynamic_span" class="windowMenuButton">x</span>
      <span id="dynamic_title_text" class="windowMenuText"></span>
    </div>
    <div id="inpDiv">
      <div id="dynamic_pictogram" style="background-image: url(&quot;/shared/png/comp_small.png&quot;); background-repeat: no-repeat;"></div>
      <span id="inpStarts">
 <input name="inp" value="start" checked="checked" type="radio"> 
 starts with
      </span> 
      <span id="inpContains" class="disabled">
 <input name="inp" value="contains" type="radio">
 contains
 <span id="inpContains3chars"> (at least 3 characters needed)</span>
      </span>
      
    </div>
    <div id="dynamic_body_div">
    </div>
  </div>
  
  <img id="key" src="/shared/png/key.png" style="top: 100.133px; left: 119.5px;">

  
<!-- div id='adsense_right_top'>
<script type="text/javascript"><!- -
google_ad_client = "pub-8853328679404934";
/* refrences_top */
google_ad_slot = "9999918284";
google_ad_width = 234;
google_ad_height = 60;
//-->
<!--/script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
</div -->




 
VoidBug
  • 65
  • 1
  • 6
  • `div.description` is the correct CSS selector to get `div` element with `class` 'description'. What's the problem? What did you mean by 'failed'? (exception, no element returned, or something else? please explain). – har07 Mar 18 '17 at 08:55
  • yes, the problem was no element returned, but I had knew that's because some codes were dynamic generate by javascript, although I don't know how to crawl this kind of websites, I won't think there are some wrong in my CSS or Xpath. LOL. so, do you know how to crawl dynamic websites? could you please recommend me some resources to learn? – VoidBug Mar 18 '17 at 12:55

1 Answers1

1

You should disable javascript in your browser since scrapy doesn't render javascript and then inspect the source:

scrapy shell http://www.zvon.org/comp/r/tut-XPath_1.html
# disable javascript in your browser and:
view_response(response)
# now inspect the body for your fields
#i.e. this `response.css("div.description")` turns into:
response.css('div#noscript_intro')
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • thanks, that's really helpful, but if I disable javascript in my browser, I will lose some codes, and now I know that's why. But I am a new learner of scrapy, and poor in javascript. I will learn javascript first, then, do you know how to crawl dynamic websites? could you please recommend me some resources to learn? – VoidBug Mar 18 '17 at 13:01
  • You don't really need to know much javascript for this, majority of the websites should have most of the content present even if you don't have javascript enabled. There's a big exception for dynamically generated websites which use AJAX requests to generate _more_ content, see this question for more info: http://stackoverflow.com/questions/8550114 . However ajax is mostly used for pagination of results rather than page's content, so for your case most of the content you are looking for should be accessible without javascript. – Granitosaurus Mar 18 '17 at 13:04