0

I am working on a plugin. I will parse HTML files. I have a naming convention like that:

<!--$include="a.html" -->

or

<!--$include="a.html"-->

is similar

According to this pattern(similar to server side includes) I want to search an HTML file. Question is that:

Find that pattern and get value (a.html at my example, it is variable)

It should be like:

while(!notFinishedWholeFile){
    fileName = findPatternFunc(htmlFile)
    replaceFunc(fileName,something)
}

PS: Using regex at Java or implementing it different(as like using .indexOf()) I don't know which one is better. If regex is good at this situation by performence I want to use it.

Any ideas?

kamaci
  • 72,915
  • 69
  • 228
  • 366
  • Regular expressions don't perform replacement. They define search patterns. You have to do the replacing yourself. And of course once you've found what you want to replace you don't need another RE to define it. Not a real question. – user207421 Dec 30 '12 at 19:28
  • @EJP I have added a pseudo code to my question. – kamaci Dec 30 '12 at 19:38
  • You haven't add anything that changes the truth of my comment. You don't need two REs. – user207421 Dec 30 '12 at 19:44
  • @EJB I have removed replacing part and improved question. – kamaci Dec 30 '12 at 19:50

3 Answers3

0

You mean like this?

<!--\$include=\"(?<htmlName>[a-z-_]*).html\"\s?-->
Muqito
  • 1,369
  • 3
  • 13
  • 27
0

Read a file into a string then

str = str.replaceAll("(?<=<!--\\$include=\")[^\"]+(?=\" ?-->)", something);

will replace the filenames with the string something, then the string can be written back to the file.
(Note: this replaces any text inside the double quotes, not just valid filenames.)

If you want only want to replace filenames with the html extension, swap the [^\"]+ for [^.]+.html.

Using regex for this task is fine performance wise, but see e.g. How to use regular expressions to parse HTML in Java? and Java Regex performance etc.

Community
  • 1
  • 1
MikeM
  • 13,156
  • 2
  • 34
  • 47
  • quote from your links: "Using regular expressions to pull values from HTML is always a mistake." and: "Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!" :) – linski Dec 31 '12 at 12:15
  • @linski. Yes, I included the links because I wanted kamaci to consider such opinions before _making up his own mind_. – MikeM Dec 31 '12 at 13:58
  • I thought it might be more visible, now that I have red it again it seems more obvious. – linski Dec 31 '12 at 14:03
0

I have used that pattern:

"<!--\\$include=\"(.+)(.)(html|htm)\"-->"
kamaci
  • 72,915
  • 69
  • 228
  • 366