3

I have a very large HTML file with the results of a security scan and I need to pull the useless information out of the document. An example of what I need to pull out looks something like this:

<tr>
<td width="20%" valign="top" class="classcell0"><span class="classtext" style="color: #ffffff; font-weight: bold !important;">Info</span></td>
<td width="10%" valign="top" class="classcell"> <a href="http://www.nessus.org/plugins/index.php?view=single&amp;id=10395" target="_blank"> 10395</a>
</td>
<td width="70%" valign="top" class="classcell"><span class="classtext" style="color: #263645; font-weight: normal;">Microsoft Windows SMB Shares Enumeration</span></td>
</tr>

After the edit the text above should just be removed. I can't do a standard find due to the variation though. Here is another example of what needs to be removed from the document:

<tr>
<td width="20%" valign="top" class="classcell0"><span class="classtext" style="color: #ffffff; font-weight: bold !important;">Info</span></td>
<td width="10%" valign="top" class="classcell"> <a href="http://www.nessus.org/plugins/index.php?view=single&amp;id=11219" target="_blank"> 11219</a>
</td>
<td width="70%" valign="top" class="classcell"><span class="classtext" style="color: #263645; font-weight: normal;">Nessus SYN scanner</span></td>
</tr>

I need to treat the ID number, 10395, as a variable, but the length stays the same. Also, "Microsoft Windows SMB Shares Enumeration" needs to be treated as a variable too, since it changes throughout the document.

I have tried throwing something like this into replace, but I think I am totally missing the mark.

<td width="10%" valign="top" class="classcell"> <a href="http://www.nessus.org/plugins/index.php?view=single&amp;id=\1\1\1\1\1" target="_blank"> \1\1\1\1\1</a>

Maybe I should be using a different tool altogether?

creigel
  • 33
  • 4
  • 1
    What are you trying to transform to what? What should the doc look like after the change? (and is this a line by line match and replace?) – Tezra Jun 16 '17 at 17:07
  • @Tezra I am just trying to remove those snippets, so just replacing them with a space or a \n. It is 6 total lines at a time that would need to be replaced if I approach it the way I am currently thinking. – creigel Jun 16 '17 at 17:09
  • 2
    So you want to remove the display text portion? Can you please add the example of what it should look like after to the question? – Tezra Jun 16 '17 at 17:12

2 Answers2

1

Regex in order from least sophisticated to more sophisticated, but all of them get the job done:

<a.*>.*\d.*</a>

<a.*>.*\d{5}.*</a>

<a.*id=\d{5}.*>.*\d{5}.*</a>

Disclaimer: be careful. I can't parse html with regex.

1

I assume by repeating \1 multiple times you mean a placeholder for a single character but that's not right. What you are trying to achieve is something like this:

<td width="10%" valign="top" class="classcell"> <a href="http://www.nessus.org/plugins/index.php?view=single&amp;id=(\d+)" target="_blank"> \1</a>

To match whole 6 lines:

<tr>\s*<td width="20%" valign="top" class="classcell0"><span class="classtext" style="color: #ffffff; font-weight: bold !important;">Info</span></td>\s*<td width="10%" valign="top" class="classcell"> <a href="http://www\.nessus\.org/plugins/index\.php\?view=single&amp;id=(\d+)" target="_blank"> \1</a>\s*</td>\s*<td width="70%" valign="top" class="classcell"><span class="classtext" style="color: #263645; font-weight: normal;">.*?</span></td>\s*</tr>

Then you can replace it with an empty string.

revo
  • 47,783
  • 14
  • 74
  • 117