0

I've a file HTML like the one below:

      <tr>
        <td>SOMETHING1</td>
        <td>SOMETHING2</td>
        <td>SOMETHING3</td>
      </tr>
      <tr>
        <td>SOMETHING1</td>
        <td>SOMETHING2</td>
        <td>SOMETHING3</td>
      </tr>
      <tr>
        <td>SOMETHING1</td>
        <td>SOMETHING2</td>
        <td>SOMETHING3</td>
      </tr>

    </table>
    <br>
    </div>
    <a href="javascript:;" onmousedown="toggleDiv('20161023');">Sunday 23 ...   </a></h3>
    <br>
    <div class="time_div" id="20161023" style="display:none">
    <p class="dep_parag">Test automation on Sunday 23 October</p>
    <table id="table" border="1" cellpadding="3" cellspacing="0">

    <tr>
        <td>SOMETHING1</td>
        <td>SOMETHING2</td>
        <td>SOMETHING3</td>
      </tr>
      <tr>
        <td>SOMETHING1</td>
        <td>SOMETHING2</td>
        <td>SOMETHING3</td>
      </tr>
      <tr>
        <td>SOMETHING1</td>
        <td>SOMETHING2</td>
        <td>SOMETHING3</td>
      </tr>

As you can see there is a list of table row divided by a section with some javascript (the section start with and finish with )

This is just an extraction of a html page containing more than 300.000 table row!

I've to delete the section with the javascript, beacuse i need just a long table row list, clean, without nothing between them.

Instead of doing it manually, that is crazy, i would like something (Regular expression) to clean the file with just one click (I use to run simple regular expression on NOTEPAD++, but this one is a little bit hard for me)

I was thinking at something like:

delete all the row from to cellspacing="0">

Or

delete all the row from and following 8 lines.

Can someone be so gentle to help me with this regex?

Thanks a lot! :)

Toto
  • 89,455
  • 62
  • 89
  • 125
ivoruJavaBoy
  • 1,307
  • 2
  • 19
  • 39
  • 1
    not sure if understand correctly, try using search mode regex, replace this regex with empty \r\n(.*\r\n){2}.*javascript.*\r\n(.*\r\n){4} – Skycc Oct 26 '16 at 14:04

3 Answers3

2

Assuming that you are not fussed about irregular whitespace, how about a search pattern of:

</table>.*?<table.*?>

With an empty "Replace with" string, tick the "Regular expression" and ". matches newline" options.

ardavey
  • 161
  • 7
1

I don't quite understand which part do you want to remove (my understanding is from </table> to cellspacing="0"> ? ), but it should be fairly easy. Do you mean something like this ?

<a href="javascript([^\n]+\r\n){5}
Ben
  • 1,133
  • 1
  • 15
  • 30
  • I know. But removing 8 lines below the is just as simple. – Ben Oct 26 '16 at 19:23
  • Taking advantage of the table structure, just <\/table>.* => will do the trick. (have to tick the . Include new line option) Some NPP version on my PC used to have a bug : replacing with empty string in regex will make the NPP crash. So usually I replace with sth. But I agree non greedy match is the formal way to go. – Ben Oct 26 '16 at 19:37
1

This regular expression will work with flag s single-line for php,python, for java initiate expression with DOTALL option

\<\/table\>.+?(?=javascript\:\;).+?(?=\<table.+?cellspacing\=\"0\"\>)\<table.+?cellspacing\=\"0\"\>
Vijay Wilson
  • 516
  • 7
  • 21
  • Why are you escaping all these characters? – Toto Oct 26 '16 at 15:09
  • It is always safe enough to precede a non-alpha numeric character with a \ to indicate that the character stands for itself. http://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions – Vijay Wilson Oct 27 '16 at 04:55