jsfiddle fails on this regex?

Question

I'm prototyping some text processing to prep research data for coding, and I've got a javascript replace statement the bombs in jsFiddle and I cannot figure out why:

   mE[1] = mE[1].replace(/<p.*>/ig, ''); // <<< this line

I'm trying to remove any opening paragraph tag.

If you look at http://jsfiddle.net/jotarkon/2e5gq/, uncomment that line and see that it the script fails.

-- click on the Heading to fire the funciton

This is driving me nuts. any ideas what's going wrong?

You are attempting to process HTML with regex? Were you aware of the [consequences](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) that this attempt might have? Just making sure before further *damage* is done. — Darin Dimitrov, Nov 18 '11 at 22:06
man, I have no choice. The output is utf-8 txt and It's going into a research coding tool. — Gordon, Nov 18 '11 at 22:08
@Gordon, of course that you have choice. A good choice could be an HTML parser. — Darin Dimitrov, Nov 18 '11 at 22:09
...especially with JavaScript, where there is an advanced DOM parser built right into the environment. — Tomalak, Nov 18 '11 at 22:10
sorry - wrong fork. I updated the link above to: http://jsfiddle.net/jotarkon/2e5gq/ — Gordon, Nov 18 '11 at 22:13
like I said, I have no choice. either I search and replace the html in the research data processing tool, or I do it here in the program that has to scrape the data from discussion forum web pages. Here is better. (yes, that's right, I have to scrape pages to get this). — Gordon, Nov 18 '11 at 22:16
@Gordon read my answer :-) The problem is an illegal character and it has nothing to do with your regex. — Pointy, Nov 18 '11 at 22:18
@Gordon You have JavaScript, running in a browser, with jQuery on top. This is more than you need to solve that problem without regexes, and more elegantly even than you do now. *Of course* you do have a choice. — Tomalak, Nov 18 '11 at 22:18
Darin, what would you suggest? I have 100 pages of student data in HTML form. I need to automate grabbing all those pages, removing some of the more annoying html, escaped (to preserve formatting within), and saved as tab delimited txt. — Gordon, Nov 18 '11 at 22:19
@Gordon Hi Gordon. How are you? Do you have an "Illegal character" error in your code? Really? Well, tell you what. Go to the semicolon on that line, go two characters to the right, then hit backspace twice. Have a nice day! — Pointy, Nov 18 '11 at 22:21
@Pointy Yes there are all kinds of illegal characters in this code. They are concentrated between forward slashes. ;) — Tomalak, Nov 18 '11 at 22:26
Tolmalak, using jQuery, how can I remove a paragraph tag, without removing it's contents, and replace the closing tag with a newline? It seemed easier to do with regex... — Gordon, Nov 18 '11 at 22:28
@Gordon: I'm working on a solution that does just that. Try something like this: http://jsfiddle.net/2e5gq/2/ — Tomalak, Nov 18 '11 at 22:37
sorry Tomalak, I don't know if I'm on a delay here. OK, it looks fine, and I see how I can use jQuery to do that. Thank you! — Gordon, Nov 18 '11 at 23:04

score 2 · Answer 1 · answered Nov 18 '11 at 22:10

2

First of all, don't use regexen for HTML. There are libraries available for that. You can't parse HTML with regexen. Second, you need to be more specific. Saying "a replace statement the bombs" tells us nothing about the nature of the error. Finally, in case you're curious, that regex is greedy, so it will replace everything from the first HTML tag that starts with the letter p until the very last > in your input indiscriminately. If you really want to use that, make it non-greedy and make sure it doesn't match other tags that start with the letter p. I'm not going to be specific because doing that is the Wrong Answer.

answered Nov 18 '11 at 22:10

Dan

10,531
2
36
55

He's got non-greedy operators in several places. – Pointy Nov 18 '11 at 22:14
thanks! I'm still learning about regex. (I think I described the nature of the error) What are the libraries your talking about? I need to take web pages and spit out plain text, with tab chars inserted (according to the html used) to created a data file for coding and statistical processing. (I don't think there is a library that is going to be easy enough to learn given the time I have to devote to this. – Gordon Nov 18 '11 at 22:42
@Gordon: There are several libraries that are pretty easy to learn. Regex won't save the day for you, and any time you invest into learing regex for the sole purpose of HTML parsing is wasted and better spent learning how to use a DOM parser. – Tomalak Nov 18 '11 at 22:47
like what library? got any links? This is kind of a one off deal. – Gordon Nov 18 '11 at 23:15

Pointy · Accepted Answer · 2011-11-18T22:28:11.357

2

The problem appears to be an actual illegal character somewhere in that line, and I don't think it has anything to do with the regex. Try typing the whole line in from scratch and delete that one. When I do that, the fiddle works fine (well, it doesn't get that error at least).

edit — the illegal character is right after the semicolon on that line. Starting from the "//" on your "this line" comment, hit backspace a few times to erase the bogus character and the semicolon, then re-type the semicolon.

edit some more - The characters are the sequence C2 AD (hex).

edited Nov 18 '11 at 22:28

answered Nov 18 '11 at 22:12

Pointy

405,095
59
585
614

yeah! too deep! OK. That problem solved. Now how to remove just an opening part of an html tag. – Gordon Nov 18 '11 at 22:27
Wow, what just happened here? Sure, you found why the JS code wasn't running, but... seriously. – Tomalak Nov 18 '11 at 22:48
The error had to do with an actual "illegal character" in the source code, probably a result of an errant cut-and-paste. – Pointy Nov 19 '11 at 03:11
Yes, I am aware of that. The *real* problem was the truly horrible way to solve a simple problem, not the stray copy/paste glitch. – Tomalak Nov 19 '11 at 07:18

jsfiddle fails on this regex?

2 Answers2