How to extract title from html body saved in mongodb, using server side javascript execution?

Question

I have mongodb collection, documents with html body and i want to extract text of html>head>title tag from each document.
I'm currently doing this with python and its causing too much network traffic.
I have read in mongodb on can execute server side javascript and i tried mongo test.js with the following code

var db = connect('mongodb://127.0.0.1:27017/analytics');
var body = db.https.find({},{_id:0,"data.body": 1}).limit(40);
while (body.hasNext()) {
    var resp =  body.next()
    var htmlBody=resp.body
    var el = document.createElement( 'html' );
    el.innerHTML = htmlBody
    var title = el.getElementsByTagName( 'title' );
    print(title)
}

E QUERY [js] uncaught exception: ReferenceError: document is not defined :
Is this possible server side if so how to do this?

Joe · Accepted Answer · 2020-05-24T22:01:00.253

0

MongoDB server side javascript does not include a HTML parsing or DOM library.

The data.body field probably contains a string. If you control the data ingest, you will probably find it simpler to parse the HTML on the way in, and store the title in a different field.

The only option I can thing of server-side is a regex, but that is usually not an easy think to do

edited May 24 '20 at 22:01

answered May 24 '20 at 07:05

Joe

25,000
3
22
44

How to extract title from html body saved in mongodb, using server side javascript execution?

1 Answers1