8

I have used Jsoup library to fetch the metadata from url.

Document doc = Jsoup.connect("http://www.google.com").get();  
String keywords = doc.select("meta[name=keywords]").first().attr("content");  
System.out.println("Meta keyword : " + keywords);  
String description = doc.select("meta[name=description]").get(0).attr("content");  
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");  

String src = images.get(0).attr("src");
System.out.println("Meta description : " + description); 
System.out.println("Meta image URl : " + src);

But I want to do it in client side using javascript

bren
  • 4,176
  • 3
  • 28
  • 43
SR230
  • 253
  • 1
  • 4
  • 13

3 Answers3

21

You can't do it client only because of the cross-origin issue. You need a server side script to get the content of the page.

OR You can use YQL. In this way, the YQL will used as proxy. https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm

Or you can use https://cors-anywhere.herokuapp.com. In this way, cors-anywhere will used as proxy:

For example:

$('button').click(function() {
  $.ajax({
    url: 'https://cors-anywhere.herokuapp.com/' + $('input').val()
  }).then(function(data) {
    var html = $(data);

    $('#kw').html(getMetaContent(html, 'description') || 'no keywords found');
    $('#des').html(getMetaContent(html, 'keywords') || 'no description found');
    $('#img').html(html.find('img').attr('src') || 'no image found');
  });
});

function getMetaContent(html, name) {
  return html.filter(
  (index, tag) => tag && tag.name && tag.name == name).attr('content');
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

<input type="text" placeholder="Type URL here" value="http://www.html5rocks.com/en/tutorials/cors/" />
<button>Get Meta Data</button>

<pre>
  <div>Meta Keyword: <div id="kw"></div></div>
  <div>Description: <div id="des"></div></div>
  <div>image: <div id="img"></div></div>
</pre>
Mosh Feu
  • 28,354
  • 16
  • 88
  • 135
  • thanks for a solution but how I can show the image from the URL. P.S the URL contain many images how to show the best one from it. – SR230 Mar 10 '16 at 10:32
  • `the best one from it` How do you know who the best? – Mosh Feu Mar 17 '16 at 05:38
  • is this is a stable solution to be used in a social network to scrape meta data from urls, like facebook do it? can it handle many concurrent requests? – Engineeroholic Dec 13 '16 at 23:46
  • @Engineeroholic I was not tested it with many requests. I'm sure Facebook doesn't do this. The "right" solution is to use a "proxy" server. For more info. read [this](http://stackoverflow.com/a/17299796/863110) answer. – Mosh Feu Dec 14 '16 at 09:10
  • This works here, but I get a 403 host error when i add it to my site. Any suggestions? I copied it word for word – Jay Aug 08 '19 at 19:37
  • 1
    What's the url you're trying to fetch? – Mosh Feu Aug 08 '19 at 19:43
  • cors-anywhere.herokuapp.com/q1HQHpIgMxHSlnWnkwx5LrYH5WyZJmmUGNQM1tycfmR0qf0mFuChzU5STJY3FT1H:1 Failed to load resource: the server responded with a status of 404 (Invalid host) – Jay Aug 08 '19 at 20:28
  • Nvm - I got it ... the input field was messed up, when manually entering everything worked ! Thanks so much for the post! – Jay Aug 08 '19 at 20:32
0

Pure Javascript function

From node.js backend (Next.js) I use that:

export const fetchMetadata = async (url) => {
    const html = await (await fetch(url, {
        timeout: 5000,
        headers: {
            'User-Agent': 'request'
        }
    })).text()
    
    var metadata = {};
    html.replace(/<meta.+(property|name)="(.*?)".+content="(.*?)".*\/>/igm, (m,p0, p1, p2)=>{ metadata[p1] = decode(p2) } );
    return metadata
}

export const decode = (str) => str.replace(/&#(\d+);/g, function(match, dec) {
    return String.fromCharCode(dec);
})

You could use it on the client with https://cors-anywhere.herokuapp.com/corsdemo

math_lab3.ca
  • 126
  • 4
0

You can use open-graph-scraper for this, for more info see this answer.

fredrivett
  • 5,419
  • 3
  • 35
  • 48