-3

Hi Have the following code, I am using the following code to remove the contents from the page which i do not know:

I am using regex, and i cannot use jsoup, please do not provide any jsoup link or code because that will be useless to use here for me..

<cfset removetitle = rereplacenocase(cfhttp.filecontent, '<title[^>]*>(.+)</title>', "\1")>

Now above the same way, i want to use the follwoing things:

1. <base href="http://search.google.com">
2. <link rel="stylesheet" href="mystyle.css">
3. and there are 5 tables inside the body, i want to remove the 2nd table.,

Can anyone guide on this

voyeger
  • 139
  • 2
  • 9
  • 2
    Take a look at jSoup - http://jsoup.org/. It is really the best tool for HTML/DOM parsing like this. – Scott Stroz Dec 05 '14 at 20:00
  • Scott is right. Why isn't jSoup an option? As to a regex solution, does the second table have an id attribute unique in the document? This is possible with regex but there are problems that regex cannot always reliably solve. – Regular Jo Dec 05 '14 at 20:06
  • come on guys, i had written in my code [jsoup] is not an option for me, i love jsoup, i had gone through its documentation, its awesome stuff, but i am limited as of now, i am unable to do – voyeger Dec 05 '14 at 20:13
  • as an update; the first two i had done like this, hopefully it is right way: `REReplaceNoCase(remBase, "<\s*link[^>]*>", "","one")` – voyeger Dec 05 '14 at 20:14
  • Well, as I asked before, does the second table have a unique html ID? – Regular Jo Dec 05 '14 at 20:14
  • how do i remove the 2nd table from the 5 table tags – voyeger Dec 05 '14 at 20:14
  • no it does not have that – voyeger Dec 05 '14 at 20:18
  • 3
    Why is JSoup not an option? No regex will work as reliably as jsoup. – Scott Stroz Dec 05 '14 at 20:23
  • my client don not like jsoup, – voyeger Dec 05 '14 at 20:27
  • 3
    If your client is not a programmer, they don't even understand what you're talking about. If your client is a programmer, they should endorse doing things the right, and more importantly: reliable, way. – Regular Jo Dec 05 '14 at 20:53
  • 2
    Your *client* should not be dictating how you go about solving their issues. Regex is simply not really appropriate for extracting DOM elements. – Adam Cameron Dec 06 '14 at 19:19

1 Answers1

7

Scott is right, and Leigh was right before, when you asked a similar question, jSoup is your best option.

As to a regex solution. This is possible with regex but there are problems that regex cannot always solve. For instance, if the first or second table contains a nested table, this regex would trip. (Note that text is not required between the tables, I'm just demonstrating that things can be between the tables)

(If there is always a nested table, regex can handle it, but if there is sometimes a nested table, in other words: unknown), it gets a lot messier.)

<cfsavecontent variable="sampledata">
<body>
<table cellpadding="4"></table>stuff
is <table border="5" cellspacing="7"></table>between
<table border="3"></table>the
<table border="2"></table>tables
<table></table>
</body>
</cfsavecontent>

<cfset sampledata = rereplace(sampledata,"(?s)(.*?<table.*?>.*?<\/table>.*?)(<table.*?>.*?<\/table>)(.*)","\1\3","ALL") />
<cfoutput><pre>#htmleditformat(sampledata)#</pre></cfoutput>

What this does is

(?s) sets . to match newlines as well. (.*?<table.*?>.*?<\/table>.*?) Matches everything before the first table, the first table, and everything between it and the second table and sets it as capture group 1. (<table.*?>.*?<\/table>) Matches the second table and creates capture group 2. (.*) matches everything after the second table and creates capture group 3.

And then the third paramters \1\3 picks up the first and third capture groups.

If you have control of the source document, you can create html comments like

<!-- table1 -->
  <table>...</table>
<!-- /table1 -->

And then use that in the regex and end up with a more regex-friendly document.

However, still, Scott said it best, not using the proper tool for the task is:

That is like telling a carpenter, build me a house, but don't use a hammer.

These tools are created because programmers frequently run into precisely the problem you're having, and so they create a tool, and often freely share it, because it does the job much better.

Community
  • 1
  • 1
Regular Jo
  • 5,190
  • 3
  • 25
  • 47