1

I have an HTML file. Here is a sample

      <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
        <span title="A*" >Individual: <span><b>A*</b></span></span>
      </div>

    </td>

  </tr>

</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

  <tr class="ListItemColorNew">

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
        <div class="GrayTextShade">
          GREY TIDE LLC (LIC# 2222) 
        </div>
      </div>
    </td>

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">FRANK WHITE A&#39;SMALLS </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: JAMES SMALLS</i></div>
        <div class="GrayTextShade">
          WEST RIVER CORP LLC (LIC# 3333) 
        </div>
      </div>
    </td>


    <td style="width: 25%; vertical-align: top">
      <div class="gvListItemStyle">
        <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
        </div>
    </td>

    <td style="width:25%;text-align:right;vertical-align:top">
      <div class="gvListItemStyle">
        <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
    </td>

  </tr>

I'm trying to extract everything between <td style="width:50%"> and </td>. The data is stored in a file testFile.txt.

This is the Perl code I used

 system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt";
SonOfSeuss
  • 400
  • 5
  • 18
  • 1
    And here's the obligatory "[don't parse HTML with regex](http://stackoverflow.com/a/1732454/2038227)" comment. – theftprevention Sep 07 '14 at 15:09
  • `^` matches the start of the line. Unless the element is right at the start of the line (i.e. no whitespace before it), you won't get any matches from your current regex. – i alarmed alien Sep 07 '14 at 15:09
  • 2
    Use some html parser such [Mojo::DOM](https://metacpan.org/pod/release/SRI/Mojolicious-5.27/lib/Mojo/DOM.pm) or [Web::Scraper](https://metacpan.org/pod/Web::Scraper) e.g `perl -Mojo -E 'say $_ for x(b("file.html")->slurp)->find(q{td[style="width:50%"]})'` – clt60 Sep 07 '14 at 15:20
  • You haven't tried very hard if your attempts are just a single line of Perl – Borodin Sep 07 '14 at 21:28
  • Just because you see one line of Perl doesn't mean that's the only thing I've tried. Your last comment is completely unhelpful. – SonOfSeuss Sep 08 '14 at 14:12

4 Answers4

1

Your below code isn't actually doing anything:

system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt");
  1. You're matching m// in a void context with no captures, so the executed statement is meaningless.

  2. Your pattern will never match your content because:

    a. You're using the any character ., but it won't match newlines unless you use the /s Modifier.

    b. You're using -p for line by line processing of the file, but your pattern would need to span lines in order to match.

The following demonstrates both a regex solution (not recommended) and using an actual HTML Parser, in this case Mojo::DOM. For a helpful 8 minute introductory video, check out Mojocast Episode 5

use strict;
use warnings;

use Mojo::DOM;

my $data = do { local $/; <DATA> };

# Regex Solution:
if ( $data =~ m{<td style="width:50%">(.*?)</td>}s ) {
    print "Regex Solution:\n$1";
} else {
    warn "No pattern match found";
}

# Parser Solution:
my $dom = Mojo::DOM->new($data);

my $yourtd = $dom->at(q{td[style="width:50%"]})->content;

print "\nMojo::DOM:\n", $yourtd;

__DATA__
<html>
<head>
<title>Hello World</title>
</head>
<body>
<table>
    <tr>
        </td>
            <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
            </div>

        </td>
    </tr>
</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

    <tr class="ListItemColorNew">
        <td style="width:50%">
            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>
        </td>
        <td style="width: 25%; vertical-align: top">
            <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>
        <td style="width:25%;text-align:right;vertical-align:top">
            <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
            </td>
    </tr>
<table>
</body>
</html>

Outputs:

Regex Solution:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>

Mojo::DOM:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>
Miller
  • 34,962
  • 4
  • 39
  • 60
  • This solution only seems to work for one div section. It fails at the second section. – SonOfSeuss Sep 11 '14 at 01:05
  • Your sample data and problem description only talked about one td with a width of 50%. If there is more to the problem it's up to you to include it in your question. Either way, the tools you will need are demonstrated above. Reading the docs and watching the linked video will go a long way toward teaching you how to adapt the solution yourself. – Miller Sep 11 '14 at 01:09
  • If you guys want to get multiple sections with the same input then this is the way it's done `while( $data =~ m{(.*?)}gs) {print "\n$1";}`. – SonOfSeuss Sep 14 '14 at 23:23
0
  .*?(<td style="width:50%">((?!<\/td>).)*?<\/td>)

See demo.Use gs flags.

See demo.

http://regex101.com/r/oC3nN4/15

vks
  • 67,027
  • 10
  • 91
  • 124
  • Unmatched ( in regex; marked by <-- HERE in m/.*?((( <-- HERE ?! at -e line 1. – SonOfSeuss Sep 07 '14 at 15:52
  • @SonOfSeuss there's only one tag with width 50 and as u can see in demo it does match.Are you sure you are putting correct flags.Dont use m flag. – vks Sep 07 '14 at 16:06
  • I think your solution is great in the simulator and it works. I tested it with more data. I think the implementation in Perl is the problem. Here's your solution: `system("perl -pi.bak -e '/.*?(((?!<\/td>).)*?<\/td>)/gs' testFile.txt");` Here's the error: `Unmatched ( in regex; marked by <-- HERE in m/.*?((( <-- HERE ?! at -e line 1.` I fixed this by escaping the double quotes but it still has no results in the textFile.txt. – SonOfSeuss Sep 07 '14 at 16:17
  • @SonOfSeuss ... the regex is a perl regex but i got no knowledge of perl :( ....... – vks Sep 07 '14 at 17:30
0

As said in the comments, remove the ^ in your regexp.

Also, use /s instead of /mg if you want to treat the file content as a single line string which allows '.' pattern to allow match new line characters '\n'.

/<td style=\"width:50%\">.+?<\\/td>/s

.+? while stop the matching at the first occurrence of </td>, not the last

Fred Sullet
  • 371
  • 1
  • 6
  • 18
  • Not working. No bugs but simply doesn't work. The output is a file the exact same as `testFile.txt`. Thanks though. – SonOfSeuss Sep 07 '14 at 15:59
  • perl -e 'open(FILE, "testFile.txt"); @data=; $string=join("\n",@data); print $1 if $string=~/(.+?<\/td>)/s;' – Fred Sullet Sep 07 '14 at 18:15
0

I hope you've seen previous advice to avoid regexes to process HTML? It's really true! The only excuse I can think of for avoiding one of the several excellent HTML modules is that your data is so malformed that nothing else will process it.

Your "sample" of your HTML file is particularly unhelpful. Before I fixed the indentation the lines were scattered all over the place. After I looked at it I saw that it was the end of one table element followed by the start of another, so it left several elements unbalanced and either closed but not opened or vice-versa. Please don't do that to us.

I built a well-formed HTML file that contains your extract, and this is a program that will process it that uses HTML::TreeBuilder

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file('html.html');
my @td50 = $tree->look_down(_tag => 'td', style => 'width:50%');
print $_->as_HTML('<>&', '  '), "\n\n" for @td50;

output

<td style="width:50%">
  <div class="gvListItemStyle"><span class="LargeText15">JAMES BOND A'MONEYPENNY </span> (LIC# 1111111) <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
    <div class="GrayTextShade"> GREY TIDE LLC (LIC# 2222) </div>
  </div>
</td>

In case you or others need it, here's the HTML input document that I used

<html>
  <body>

    <table>
      <tr>
        <td>
          <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
          </div>
        </td>
      </tr>
    </table>

    <table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">
      <tr class="ListItemColorNew">

        <td style="width:50%">
          <div class="gvListItemStyle">
            <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
            <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
            <div class="GrayTextShade">
              GREY TIDE LLC (LIC# 2222) 
            </div>
          </div>
        </td>

        <td style="width: 25%; vertical-align: top">
          <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>

        <td style="width:25%;text-align:right;vertical-align:top">
          <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
        </td>

      </tr>
    </table>
  </body>
</html>
Borodin
  • 126,100
  • 9
  • 70
  • 144