17

I have a number of DOM elements being dynamically created on a web page. Their IDs are generated from an external list and sometimes these names may contain illegal characters for an ID like "@" or "&".

I need to remove chracters that do not match the following rules:

  • The string must begin with a letter
  • The first character may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".")

So, if the original string is:

99% of People are not the 1%

Then the resulting string with illegal characters removed would be:

ofPeoplearenotthe1

Can anyone help me to write the regex in Javascript that will remove characters from a string that do not follow the above requirements?

6 Answers6

36
var str = "99% of People are not the 1%";
str = str.replace(/^[^a-z]+|[^\w:.-]+/gi, "");
Qtax
  • 33,241
  • 9
  • 83
  • 121
  • 6
    Note that IDs must also be unique. If you're removing the illegal characters to comply with standards, you will also need to maintain a list of "used" IDs, so that you can avoid collisions. – Matt Mar 09 '12 at 14:35
  • hello...can you provide the C# version of this regular expression please...?? – umair.ali Jan 01 '13 at 07:19
  • 3
    @umair.ali, it would be pretty much the same, and could be quoted like so: `@"(?i:^[^a-z]+|[^\w:.-]+)"` – Qtax Jan 04 '13 at 18:07
  • How do I do without replacing the spaces?? So like this: `of People are not the 1` – Squirrl Feb 22 '14 at 05:59
  • 2
    This does not appear to removing periods from the ID? Not sure if this is invalid as per HTML spec; however it does prevent JQuery from accessing the element using the ID selector. I ended up using this `str.replace(/^[^a-z]+|[^\w]+/gi, "")` – JeffryHouser Jun 15 '17 at 13:49
  • @JeffryHouser [periods are allowed in IDs](https://stackoverflow.com/a/79022/107152). To match them in jQuery (or any other CSS selector) you need to escape with a backslash, eg `$("#foo\\.bar")` (the backslash it self needs to be escaped in JS strings, thus two in this case). – Qtax Jun 16 '17 at 08:44
  • @Qtax Thanks, I'll have to experiment with that. I was using a variable, so something like this: `$(foo)` . – JeffryHouser Jun 16 '17 at 14:44
  • 3
    The accepted answer uses the i flag when it is not really needed and may unnecessarily increase the regex run time. A more specific (and thus more efficient) regex would be: `str = str.replace(/^[^a-zA-Z]+|[^\w:.-]+/g, "");` – Nadav Jul 25 '17 at 03:16
2

The HTML5 specification has been updated and according to https://html.spec.whatwg.org/multipage/dom.html#global-attributes id attributes can now contain literally any character for their value except whitespace.

When specified on HTML elements, the id attribute value must be unique amongst all the IDs in the element's tree and must contain at least one character. The value must not contain any ASCII whitespace.

I'm not sure at which point elements could be assigned two id attributes nor what logical objective reasoning for it (perhaps the less matured comprehension at the time) though that has been nixed from the standard however that has been common knowledge in the web development community for years now.

John
  • 1
  • 13
  • 98
  • 177
  • 1
    I think the "uniqueness" mentioned in the spec is not about a possible assignment of two IDs to one element. But the requirement of the ID to be unique within the DOM tree. So that it can serve it's main purpose: helping with identifying and referencing elements. In most cases classes would be enough for that (and are mostly more flexible). But one example where IDs are still needed is when connecting form field inputs with labels via the label's "for" attribute: `` – mwld May 05 '17 at 09:48
2

If you want something that is resistant to conflicts, try using btoa to convert into base64;

var badId1 = "99% of the 1%";
var badId2 = "999% of the 1%";
var validId1 = "ID_OTklIG9mIHRoZSAxJQ";
var validId2 = "ID_OTk5JSBvZiB0aGUgMS";

var makeId = function(text) { return "ID_" + btoa(text).slice(0,-2); }; 

expect(makeId(badId1)).toEqual(validId1);
expect(makeId(badId2)).toEqual(validId2);

Notice how the two IDS generate different keys, where the regex trim would not.

VeryColdAir
  • 107
  • 2
  • 8
Steve Cooper
  • 20,542
  • 15
  • 71
  • 88
1

If anybody need this in Java:

    if(! htmlId.matches("^[A-Za-z0-9]+[\\w\\-\\:\\.]*$")){
        LOG.warn("html id "+htmlId+" is not valid, have to remove all invalid chars");

        htmlId = htmlId.replaceAll("[^^A-Za-z0-9\\w\\-\\:\\.]+", "");
    }

In my case I checked the String and replaced all invalid with blank. Thanks to Qtax.

ziodraw
  • 190
  • 1
  • 4
1
var id = "99% of People are not the 1%";
id = id.replace(/[^a-z0-9\-_:\.]|^[^a-z]+/gi, "");

Demo: http://jsfiddle.net/jfriend00/qqjh6/

The idea is to replace one or more non alpha characters at the beginning and then replace all other illegal characters in the remaining part of the string.

One might ask what is the point of even having an id that is not known ahead of time and is dynamically generated based on content. You can't very well use it in CSS if it's based on some content that can change.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
0

As John mentioned the HTML5 spec allows all characters for IDs except whitespaces.

That means the following RegEx (in JavaScript) would be enough to follow the HTML5 spec:

let str = "99% of People are not the 1%";
str = str.replace(/\s+/g, "");
// "99%ofPeoplearenotthe1%"
mwld
  • 225
  • 2
  • 8