Search in html source with GOOGLE?

Question

I have several websites, and I can't remember where I wrote some lines of code. As my pages are indexed by Google, I would like to know if Google offers a facility to search within the HTML source code/mark-up itself, instead of just allowing search within the visual, rendered, part of a page?

Thanks

Sometimes I don't really understand the moderators here. Closing this well voted question after 11 years? — Peter VARGA, Dec 16 '21 at 13:12

score 42 · Answer 1 · edited Nov 21 '20 at 13:20

42

I've come across the following resources on my travels (some already mentioned above):

HTML Mark-up-focused search engines

Nerdydata

I'd also like to throw in the following:

Huge, website crawl data archives

Common Crawl - 'years of free web page data to help change the world' (over 250TB+)

How can we analyze this crawl data?

For an idea of how to begin analyzing some of this massive data, take a look at Big Data/Map-reduce-type frameworks(s).

Google lists some ideas on using Apache's Spark project to analyze Common Crawl's dump(s). To understand the file format(s) used by Common Crawl, refer to the following:

The article, Accessing-Common-Crawl-Dataset-on-S3, outlines accessing Common Crawl's 250TB+ dump(s) in a low cost manner without transferring that data load outside of Amazon's AWS/S3 network. Of course, that assumes you are going to use some combination AWS/EC2/S3 etc. to analyze the crawl data.

Finally, Patrick Durusau maintains some interesting Common-Crawl-usage-related blog pages.

Personally, I find this subject intriguing, I suggest we get this crawl data while it's HOT! ;-)

edited Nov 21 '20 at 13:20

Dave Powers

2,051
2
30
34

answered Feb 18 '15 at 17:41

Big Rich

5,864
1
40
64

In my case, the site engine is leaking private urls from a particular domain *(I am sure it doesn’t come from users)*. How I can the search in the source of a single domain ? *(in order to find where the leak come from )* – user2284570 Sep 17 '15 at 00:06
Assuming you have access to a Unix-like Bash console (try 'Git Bash', unxutils or cygwin on Windows), you could use a number of solutions based on various combinations of wget/curl/xidel/grep/awk for example. [This SO post](http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only) contains various solutions, [this is the Google search I used](https://www.google.com/search?q=extract+urls+(curl+OR+wget)). – Big Rich Sep 17 '15 at 09:10
Basically, you'll want to loop over important URL's within your domain to find/store which pages are 'leaking'. – Big Rich Sep 17 '15 at 09:17
The site is several petabytes large with billions of ᴜʀʟs. Near all page aren’t static. Do you have a better solution than crawling it myself ? – user2284570 Sep 17 '15 at 15:16
Sounds like you *may* need to run your crawls in a high-concurrency environment. A clustered actor pattern, such as Scala/Java's [Akka](http://akka.io) should do it or a have a look at a similarly-clustered map-reduce pattern (feeding out the URL collection/identification work to sub-units, on [Spark](http://spark.apache.org) or [Hadoop](https://hadoop.apache.org)). I'm including [some related resource URL in a pastebin](http://pastebin.com/hYfpefv0). It would be interesting to find out which direction you go in, please let us know. – Big Rich Sep 17 '15 at 15:52
Ok, in fact I have to nothing to with the company, but they aren’t replying to my e‑mail that I sent to their address dedicated for reporting security problems (I was very tired and wrote mistakes, I am afraid I fed them up). But this should be discussed privately. – user2284570 Sep 17 '15 at 20:54
any source where i can put like source site (to be scanned) + target site link in it code ? – Divya Sep 02 '17 at 09:05

score 14 · Answer 2 · answered Apr 26 '16 at 05:49

14

You can try PublicWWW for search in source/mark-up. It allows to find any HTML, JavaScript, CSS and plain text in web page source code on 167+ million websites.

With PublicWWW you can:

Find related websites through the unique HTML codes they share, i.e. widgets & publisher IDs.
Identify sites using certain images or badges.
Find out who else is using your theme.
Identify sites mentioning you.
Find your competitor's affiliates.
Identify sites where your competitors personally collaborate or interact.
References to use a library or a platform.
Find code examples on the net.
Figure out who is using what JS widgets on their sites.
...

Of course you can find not only your websites which use some code/mark-up snippet.

answered Apr 26 '16 at 05:49

James Andreenko

167
1
3

3

Worth noting that only the websites in the top 1 million are revealed for free. Results from the top 3 million are revealed after registering. The rest are paid. Also, the revealed results only show the domain and not the full URL. – glebm Nov 20 '16 at 22:56
is this page broken? I don't care if I have to pay to get the information, however when trying to purchase I get "Plan not available" on all items, does somebody know what happened to the page? it has been about 4 months like that – John Balvin Arias Nov 06 '21 at 15:25

Search in html source with GOOGLE?

2 Answers2