0

I have mongodb collection, documents with html body and i want to extract text of html>head>title tag from each document.
I'm currently doing this with python and its causing too much network traffic.
I have read in mongodb on can execute server side javascript and i tried mongo test.js with the following code

var db = connect('mongodb://127.0.0.1:27017/analytics');
var body = db.https.find({},{_id:0,"data.body": 1}).limit(40);
while (body.hasNext()) {
    var resp =  body.next()
    var htmlBody=resp.body
    var el = document.createElement( 'html' );
    el.innerHTML = htmlBody
    var title = el.getElementsByTagName( 'title' );
    print(title)
}

E QUERY [js] uncaught exception: ReferenceError: document is not defined :
Is this possible server side if so how to do this?

Doe
  • 19
  • 4

1 Answers1

0

MongoDB server side javascript does not include a HTML parsing or DOM library.

The data.body field probably contains a string. If you control the data ingest, you will probably find it simpler to parse the HTML on the way in, and store the title in a different field.

The only option I can thing of server-side is a regex, but that is usually not an easy think to do

Joe
  • 25,000
  • 3
  • 22
  • 44