7

I have raw html with some css classes inside for various tags.

Example:

Input:

<p class="opener" itemprop="description">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Neque molestias natus iste labore a accusamus dolorum vel.</p>

and I would like to get just plain html like:

Output:

<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Neque molestias natus iste labore a accusamus dolorum vel.</p>

I do not know names of these classes. I need to do this in JavaScript (node.js).

Any idea?

David Thomas
  • 249,100
  • 51
  • 377
  • 410
Pavel Binar
  • 2,096
  • 5
  • 16
  • 26
  • why does the HTML have these classes - is it generated from a CMS or similar, and if not, can it be removed from the source? – AlexHighHigh Jan 08 '14 at 18:15
  • I'd like to suggest you change your title to "How do I remove all attributes from an HTML tag?" as it actually seems to have nothing to do with "css references", whatever those are. – user229044 Jan 08 '14 at 18:17
  • In the example `itemprop="description"` is not a CSS attribute, but an HTML element property. I would guess you want to look for an HTML parser of some kind instead, as attributes are sometimes necessary for HTML elements (such as `` and ``). – Radley Sustaire Jan 08 '14 at 18:18
  • 1
    You need a HTML parser that turns that string into nodes and parses it, something like [**cheerio**](https://github.com/MatthewMueller/cheerio) – adeneo Jan 08 '14 at 18:19
  • `itemprop` is not a css class. Do you simply want to remove all attributes? Also, some classes might not only be used for CSS – Bergi Jan 08 '14 at 19:35
  • @AlexHighHigh html is scraped by node.js scraper using cheerio from already styled website – Pavel Binar Jan 08 '14 at 19:57
  • @Bergi, RadGH Yes, I want remove all attributes, sorry for poor description – Pavel Binar Jan 08 '14 at 19:59
  • @adeneo any idea how to accomplish that with cheerio? – Pavel Binar Jan 08 '14 at 20:00
  • @Pavel - Sure give me a second to post an answer – adeneo Jan 08 '14 at 20:18

10 Answers10

15

This can be done with Cheerio, as I noted in the comments.
To remove all attributes on all elements, you'd do:

var html = '<p class="opener" itemprop="description">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Neque molestias natus iste labore a accusamus dolorum vel.</p>';

var $ = cheerio.load(html);   // load the HTML

$('*').each(function() {      // iterate over all elements
    this.attribs = {};     // remove all attributes
});

var html = $.html();          // get the HTML back
adeneo
  • 312,895
  • 29
  • 395
  • 388
5

I would create a new element, using the tag name and the innerHTML of that element. You can then replace the old element with the new one, or do whatever you like with the newEl as in the code below:

// Get the current element
var el = document.getElementsByTagName('p')[0];

// Create a new element (in this case, a <p> tag)
var newEl = document.createElement(el.nodeName);

// Assign the new element the contents of the old tag
newEl.innerHTML = el.innerHTML;

// Replace the old element with newEl, or do whatever you like with it
MattDiamant
  • 8,561
  • 4
  • 37
  • 46
1

Here is another solution to this problem in vanilla JS:

html.replace(/\s*\S*\="[^"]+"\s*/gm, "");

The script removes all attributes from a string named html using a simple regular expression.

0

perhaps some regex in js could pluck out those css tags and then output the stripped down version? thats if i'm understanding your question corre

hammerfestus
  • 23
  • 1
  • 4
0

Maybe, just use Notepad++ and a quick "Find/Replace" action with a blank space will be the fastest way, instead of thinking in a parser or something similar.

Alberto Montellano
  • 5,886
  • 7
  • 37
  • 53
0

improvise this:

$('.some_div').each(function(){
    class_name = $(this).attr('class');
    $(this).removeClass(class_name)})
  • no need to do it on server side. client side on some event like after you have loaded/changed the data in that container. bind the event to body. – Aditya Shedge Jan 08 '14 at 18:49
  • 1
    But the question is specifically tagged node.js, and why do you assume it's even sent to a browser – adeneo Jan 08 '14 at 19:01
  • You may do that with cheerio = jQuery api implementation for node.js Good hint, thanks! But I do not know '.some_div' – Pavel Binar Jan 08 '14 at 20:13
0

In python, do like this but provide a list of files and tags instead of the hard coded ones, then wrap in a for loop:

#!/usr/bin/env python
# encoding: utf-8
import re
f=open('fileWithHtml','r')

for line in f.readlines():
        line = re.sub('<p\s(.*)>[^<]', '<p>', line)
        print(line)

Most probably, this can be easily translated into JavaScript for nodejs

The D Merged
  • 680
  • 9
  • 17
0

You could dynamically parse the the elements using a DOM (or SAX, depending on what you want to do) parser and remove all the style attributes met.

On JavaScript, you could use HTML DOM removeAttribute() Method.

<script>
  function myFunction()
  {
    document.getElementsByClassName("your div class")[0].removeAttribute("style"); 
};
</script>
Nick Louloudakis
  • 5,856
  • 4
  • 41
  • 54
0

I'm providing the client-side (browser) version as this answer came up when I googled remove HTML attributes:

// grab the element you want to modify
var el = document.querySelector('p');

// get its attributes and cast to array, then loop through
Array.prototype.slice.call(el.attributes).forEach(function(attr) {

    // remove each attribute
    el.removeAttribute(attr.name);
});

As a function:

function removeAttributes(el) {

    // get its attributes and cast to array, then loop through
    Array.prototype.slice.call(el.attributes).forEach(function(attr) {

        // remove each attribute
        el.removeAttribute(attr.name);
    });
}
John Doherty
  • 3,669
  • 36
  • 38
0
$ = cheerio.load(htmlAsString);

const result = $("*")
 // specify each attribute to remove, "*" as wildcard does not work
.removeAttr("class")
.removeAttr("itemprop")
.html();
// if you also wanted to remove the inner text for some reason, comment out the previous .html() and use
//.text("")
//.toString();

console.log("result", result);
tno2007
  • 1,993
  • 25
  • 16