78

I have several websites, and I can't remember where I wrote some lines of code. As my pages are indexed by Google, I would like to know if Google offers a facility to search within the HTML source code/mark-up itself, instead of just allowing search within the visual, rendered, part of a page?

Thanks

Big Rich
  • 5,864
  • 1
  • 40
  • 64
Entretoize
  • 2,124
  • 3
  • 23
  • 44

2 Answers2

42

I've come across the following resources on my travels (some already mentioned above):

HTML Mark-up-focused search engines

I'd also like to throw in the following:

Huge, website crawl data archives

How can we analyze this crawl data?

For an idea of how to begin analyzing some of this massive data, take a look at Big Data/Map-reduce-type frameworks(s).

Google lists some ideas on using Apache's Spark project to analyze Common Crawl's dump(s). To understand the file format(s) used by Common Crawl, refer to the following:

The article, Accessing-Common-Crawl-Dataset-on-S3, outlines accessing Common Crawl's 250TB+ dump(s) in a low cost manner without transferring that data load outside of Amazon's AWS/S3 network. Of course, that assumes you are going to use some combination AWS/EC2/S3 etc. to analyze the crawl data.

Finally, Patrick Durusau maintains some interesting Common-Crawl-usage-related blog pages.

Personally, I find this subject intriguing, I suggest we get this crawl data while it's HOT! ;-)

Dave Powers
  • 2,051
  • 2
  • 30
  • 34
Big Rich
  • 5,864
  • 1
  • 40
  • 64
  • In my case, the site engine is leaking private urls from a particular domain *(I am sure it doesn’t come from users)*. How I can the search in the source of a single domain ? *(in order to find where the leak come from )* – user2284570 Sep 17 '15 at 00:06
  • Assuming you have access to a Unix-like Bash console (try 'Git Bash', unxutils or cygwin on Windows), you could use a number of solutions based on various combinations of wget/curl/xidel/grep/awk for example. [This SO post](http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only) contains various solutions, [this is the Google search I used](https://www.google.com/search?q=extract+urls+(curl+OR+wget)). – Big Rich Sep 17 '15 at 09:10
  • Basically, you'll want to loop over important URL's within your domain to find/store which pages are 'leaking'. – Big Rich Sep 17 '15 at 09:17
  • The site is several petabytes large with billions of ᴜʀʟs. Near all page aren’t static. Do you have a better solution than crawling it myself ? – user2284570 Sep 17 '15 at 15:16
  • Sounds like you *may* need to run your crawls in a high-concurrency environment. A clustered actor pattern, such as Scala/Java's [Akka](http://akka.io) should do it or a have a look at a similarly-clustered map-reduce pattern (feeding out the URL collection/identification work to sub-units, on [Spark](http://spark.apache.org) or [Hadoop](https://hadoop.apache.org)). I'm including [some related resource URL in a pastebin](http://pastebin.com/hYfpefv0). It would be interesting to find out which direction you go in, please let us know. – Big Rich Sep 17 '15 at 15:52
  • Ok, in fact I have to nothing to with the company, but they aren’t replying to my e‑mail that I sent to their address dedicated for reporting security problems (I was very tired and wrote mistakes, I am afraid I fed them up). But this should be discussed privately. – user2284570 Sep 17 '15 at 20:54
  • any source where i can put like source site (to be scanned) + target site link in it code ? – Divya Sep 02 '17 at 09:05
14

You can try PublicWWW for search in source/mark-up. It allows to find any HTML, JavaScript, CSS and plain text in web page source code on 167+ million websites.

With PublicWWW you can:

  • Find related websites through the unique HTML codes they share, i.e. widgets & publisher IDs.

  • Identify sites using certain images or badges.

  • Find out who else is using your theme.
  • Identify sites mentioning you.
  • Find your competitor's affiliates.
  • Identify sites where your competitors personally collaborate or interact.
  • References to use a library or a platform.
  • Find code examples on the net.
  • Figure out who is using what JS widgets on their sites.
  • ...

Of course you can find not only your websites which use some code/mark-up snippet.

James Andreenko
  • 167
  • 1
  • 3
  • 3
    Worth noting that only the websites in the top 1 million are revealed for free. Results from the top 3 million are revealed after registering. The rest are paid. Also, the revealed results only show the domain and not the full URL. – glebm Nov 20 '16 at 22:56
  • is this page broken? I don't care if I have to pay to get the information, however when trying to purchase I get "Plan not available" on all items, does somebody know what happened to the page? it has been about 4 months like that – John Balvin Arias Nov 06 '21 at 15:25