5

I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.

For example, the HTML String might be:

<ul>
  <li>First</li>
  <li>Second</li>
</ul>

What I want:

First Second

or

First
Second

I've tried to get the text content by first wrapping the entire string inside a div and then getting the textContent using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond which is not what I want).

The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces. Are there any cleaner, neater, and simpler solution than this?

Heretic Monkey
  • 11,687
  • 7
  • 53
  • 122
thesamiroli
  • 420
  • 3
  • 8
  • 21
  • 3
    You can use the package [cheerio](https://github.com/cheeriojs/cheerio) to do these sorts of things, it is built for scraping/navigating/selecting HTML content. – Jon Church Mar 03 '20 at 20:12

5 Answers5

4

Convert HTML to Plain Text:

In your terminal, install the html-to-text npm package:

npm install html-to-text

Then in JavaScript::

const { convert } = require('html-to-text'); // Import the library

var htmlString = `
<ul>
  <li>First</li>
  <li>Second</li>
</ul>
`;

var text = convert(htmlString, { wordwrap: 130 })
// Out:
// First
// Second
  • Hope this helps!
Ramy Hadid
  • 120
  • 6
1

You can try get rid of html tags using regex, for the yours example try the following:

let str = `<ul>
<li>First</li>
<li>Second</li>
</ul>`

console.log(str)

let regex = '<\/?!?(li|ul)[^>]*>'

var re = new RegExp(regex, 'g');

str = str.replace(re, '');
console.log(str)
elvira.genkel
  • 1,303
  • 1
  • 4
  • 11
1

Using the DOM, you could use document.Node.textContent. However, NodeJs doesn't have textContent (since it doesn't have native access to the DOM), therefore you should use external packages. You could install request and cheerio, using npm. cheerio, suggested by Jon Church, is maybe the easiest web scraping tool to use (there are also complexer ones like jsdom) With power of cheerio and request in your hands, you could write

const request = require("request");
const cheerio = require("cheerio");
const fs = require("fs");

//taken from https://stackoverflow.com/a/19709846/10713877
function is_absolute(url)
{
    var r = new RegExp('^(?:[a-z]+:)?//', 'i');
    return r.test(url);
}

function is_local(url)
{
    var r = new RegExp('^(?:file:)?//', 'i');
    return (r.test(url) || !is_absolute(url));
}

function send_request(URL)
    {
        if(is_local(URL))
        {
            if(URL.slice(0,7)==="file://")
                url_tmp = URL.slice(7,URL.length);
            else
                url_tmp = URL;

           //taken from https://stackoverflow.com/a/20665078/10713877
           const $ = cheerio.load(fs.readFileSync(url_tmp));
           //Do something
           console.log($.text())
        }
        else
        {
            var options = {
                url: URL,
                headers: {
                  'User-Agent': 'Your-User-Agent'
                }
              };

            request(options, function(error, response, html) {
                //no error
                if(!error && response.statusCode == 200)
                {
                    console.log("Success");

                    const $ = cheerio.load(html);


                    return Promise.resolve().then(()=> {
                        //Do something
                        console.log($.text())
                    });
                }
                else
                {
                    console.log(`Failure: ${error}`);
                }
            });
        }
    }

Let me explain the code. You pass a URL to send_request function. It checks whether the URL string is a path to your local file, (a relative path, or a path starting with file://). If it is a local file, it proceeds to use cheerio module, otherwise, it has to send a request, to the website, using the request module, then use cheerio module. Regular Expressions are used in is_absolute and is_local. You get the text using text() method provided by cheerio. Under the comments //Do something, you could do whatever you want with the text. There are websites that let you know 'Your-User-Agent', copy-paste your user agent to that field.

Below lines will work

//your local file
send_request("/absolute/path/to/your/local/index.html"); 
send_request("/relative/path/to/your/local/index.html"); 
send_request("file:///absolute/path/to/your/local/index.html"); 
//website
send_request("https://stackoverflow.com/"); 

EDIT: I am on a linux system.

Hakan Demir
  • 307
  • 2
  • 4
  • 12
  • 1
    This is helpful but there's a LOT more in here than what is needed to answer the question. If you want to make this answer more useful I'd recommend stripping out everything about using fs/request in node and answer how to turn an html string into the inner text of the nodes. – Gabe O'Leary Nov 19 '20 at 22:53
1

Okay you can try this example, This may help you

I used JSDom module

https://www.npmjs.com/package/jsdom

const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent); 

BTW Helped me enter image description here

This code can help I think :)

Dupinder Singh
  • 7,175
  • 6
  • 37
  • 61
0

You can try using npm library htmlparser2. Its will be very simple using this

const htmlparser2 = require('htmlparser2');

const htmlString = ''; //your html string goes here
const parser = new htmlparser2.Parser({
    ontext(text) {
      if (text && text.trim().length > 0) {
        //do as you need, you can concatenate or collect as string array
      }
    }
  });

parser.write(htmlString);
parser.end();
vikash vik
  • 686
  • 5
  • 10