1

Hello Stackoverflow,

I would like to know how does one crawl syntax highlited code ?

This is how something in a certain tag is crawled

for sel in response.xpath('//ol/li/h3'):

, however in Syntax Higlited code such as this Syntax Highlited text code which outputs

cout << "\n Choose your action:" << endl;

this shows that one would need multiple tags to crawl a sepcific line, what happens then, when one has multiple lines of codes eg. Just 2 Lines ?

Then comes the question of how does one crawl whitespaces ? According to code

<li class="li1">
   <div class="de1">
      &nbsp;
   </div>
</li>

Whitespaces are expressed or represented by &nbsp , how can we end up not crawling it as text ?

Note: I am coding in Python and using the Scrapy web crawler/spider

Thanks for reading and offering help.

CharlieC
  • 123
  • 1
  • 8

2 Answers2

0

I'm not sure if I am stating the obvious...

What XPATH selectors are really crawling through is the structured text, which requires fairly clean XML to work properly.

Lots of XML by nature is not very human-readable, thus the high level of order and the coloring to help our eyes follow the nested levels of tags:

<div></div>

An XPath query does not pay attention to what is in between tags, but rather the tags themselves (type, attributes, so on...). So if you crawl clean HTML or XML, it doesnt matter how deep or how far away, it will land you on the tag set you are aiming for (then you will likely want to handle the contents yourself)

Well formed XML is usually required to have at least one set of root tags at minimum. So the shortest bit you should see is...

<html>
    <div>
            1
    </div>
    <div>
         2
    </div>
    <div>
        <h1>Hello</h1>
    </div>
</html>

So

for sel in response.xpath('//'):

should iterate all 3 , and

for sel in response.xpath('//div//h1'):

would STEP INTO only the very last and would STEP ON the tag, where you could then ready its contents if you wanted.


Second, HTML and XML actually dont give much credance to whitespace (even though your example looked pretty, that was for your benefit, not the benefit of your code). Python can likewise be told to treat blank lines and single spaces as the same thing (your XPath query should skip whitespace by default).

Edit: As for encoded entities, such as &nbsp;, most html packages have an htmlEntityDecode function, as those symbols can cause pain in other areas. You would want to Decode the entities into their normal character which are often whitespace, left-bracket, right-bracket, and so on...

user2097818
  • 1,821
  • 3
  • 16
  • 34
  • Those examples are not mine, just trying to crawl it, specifically, Snipplr.com The code presented was syntax highlited and as far as i know, its quite impossible/tedious to crawl Highlited code – CharlieC Jan 09 '15 at 06:05
  • @CharlieC I see what your saying. It might be possible to parse AS IS, but it will likely not be easy (right click in the *code* area and view page source with your browser). You are crawling through lots of span tags with no real sense of *"structure"*, at least not in the Scope of XPATH..... Do you want to retrieve the code in general, or do you want to retrieve *'pretty'* code? – user2097818 Jan 09 '15 at 06:18
  • @CharlieC I think it is safe to say you are getting into the realm of [Beautiful Soup](http://beautiful-soup-4.readthedocs.org/en/latest/) and a fair bit of custom handling. It really just depends what you want to get out of it. – user2097818 Jan 09 '15 at 06:28
  • I was looking to retrieve preety code in reply in you first question. What are the diffrence though with Beautiful Soup and Scrapy ? Both seem to use the same "Tag" system. – CharlieC Jan 09 '15 at 06:33
  • @CharlieC Read my second posted answer. BeautifulSoup is for taking disgusting HTML source code, striping away all the HTML parts, until all you are left with are the actual contents. And then it helps you glue those contents back together again. But I dont think you need any of that (not today at least). – user2097818 Jan 09 '15 at 06:35
0

One last suggestion for ya given your first response earlier.

In those code boxes (at snipplr.com) there are a group of links in the top right corner that let you select how you want to view the snippet. you need to crawl to that link, go there, and parse the plain text version instead.

Compare these two links...they both point to the same article, but the second is very readable HTML source:

user2097818
  • 1,821
  • 3
  • 16
  • 34
  • Yes, i did initially consider going through the plain text option then later formatting it with a syntax highlighter but found it to not be consistent with other websites, i am pulling from 2 code repositories websites, plain text in not present in the other website but both used syntax highlighting thus needing to crawl syntax highlited code instead of plain tax. – CharlieC Jan 09 '15 at 06:38
  • @CharlieC You need something to prettify that ugly source code so you can begin to wrap your mind around how you are going to take it apart. Then once it is apart, your not going to have functioning code left anymore, so you will need help putting the contents back together. Unless there is a magic tool that I have not heard of that does this in a click (there honestly could be), you are going to be spending a lot of time getting very skilled with python text-processing/RegEX tools. – user2097818 Jan 09 '15 at 06:48
  • Ah i see, i would think embedding that Div class of the Syntax Highliter with tables would be the best option, have not e=heard of any tools that takes source code apart and organises it. – CharlieC Jan 09 '15 at 06:51
  • That might work for ya, I thought you wanted to parse the code, but really it sounds more like you just want to copy it or "take a picture" of it so you can print it or something. They do have tools for freezing HTML into PDF, but this is outside of python, and it usually is quite as nice as you wanted it to be, but its easy. – user2097818 Jan 09 '15 at 06:57
  • I did actually want to parse the code but as u said, not possible. You might get the wrong idea, not prasing website source code but syntax highlited source code, **[Actual Link](http://www.snipplr.com/view/85476/futurama-spaceship-game/)** – CharlieC Jan 09 '15 at 07:12
  • `//*[@id="innersource"]/pre/ol/li[@class="li1"]` Use that as your XPATH query (I didnt test), but that will only get you down to the actual payload per line. Everything within each of those nodes represents the contents of 1 line, so even if you see 10 span tags, you strip every tag, every whitespace, and cram it all together to get the contents of that line. **The only whitespace to save** is that which is double quoted **outside span tags**. It seems that text outside the span pairs are striped of their quotes, but if those quotes retained whitespace, it should be preserved. – user2097818 Jan 09 '15 at 07:57
  • And im still not sure if your indentation will come out right – user2097818 Jan 09 '15 at 08:00
  • Highly doubt so, i would think if i did that, each text inside the tax would be on a different line. – CharlieC Jan 09 '15 at 08:05
  • Newlines count as whitespace, you strip those as well. The only thing allowed to start a new line, is by hitting that
  • node again.
  • – user2097818 Jan 09 '15 at 08:07
  • Would it be able to use Beautiful soup then just extract text only ? Not sure if it would work. http://stackoverflow.com/a/13490061/4294247 – CharlieC Jan 09 '15 at 08:19
  • Your definitely on the right track now. Beautiful Soup is a great tool to learn (because it helps with many other problems), but you could also perform striping like this using the standard builtin RegularExpression modules. Regex is really not fun to learn the first time through, but it starts to grow on you for quick and dirty string manipulation. Good Luck – user2097818 Jan 09 '15 at 09:36
  • Thanks for you help user2097818, yeah, Regex is one of the most confusing things to learn in programming as it is barely close to English or anything legible – CharlieC Jan 09 '15 at 10:06