How to get strings inbetween two Strings

Question

I have group of html files where i have to extract content between <hr> and </hr> tags.I have done everything except this extraction.What i have done is

1.Loaded all html files and store it in @html_files.

2.Then I am storing each file's content in @useful_files array.

3.Then I am looping the @useful_files array and checking each line where <hr> is found.If found I need next lines of content in @elements array.

Is it possible.Am I in the right?

 foreach(@html_files){
 $single_file = $_;
 $elemets = ();
 open $fh, '<', $dir.'/'.$single_file or die "Could not open '$single_file' $!\n";
@useful_files = ();
@useful_files = <$fh>;
foreach(@useful_files){
    $line = $_;
    chomp($line);
    if($line =~ /<hr>/){
        @elements = $line;
    }
}
create(@elements,$single_file)
}

Thanks !!!

My input html file will be like this

<HR  SIZE="3" style="COLOR:#999999" WIDTH="100%" ALIGN="CENTER">
<P STYLE="margin-top:0px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">Lorem ipsum dolor sit amet, consectetur adipiscing elit.  </FONT></P> 
<P STYLE="font-size:12px;margin-top:0px;margin-bottom:0px">&nbsp;</P>
<TABLE CELLSPACING="0" CELLPADDING="0" WIDTH="100%" BORDER="0"  STYLE="BORDER-COLLAPSE:COLLAPSE">
<TR>
<TD WIDTH="45%"></TD>
<TD VALIGN="bottom" WIDTH="1%"></TD>
<TD WIDTH="4%"></TD>
<TD VALIGN="bottom"></TD>
<TD WIDTH="4%"></TD>
<TD VALIGN="bottom" WIDTH="1%"></TD>
<TD WIDTH="44%"></TD></TR>
<TR>
<TD VALIGN="top"></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="bottom"></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="bottom"><FONT STYLE="font-family:Times New Roman" SIZE="2">Title:</FONT></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="bottom"><FONT STYLE="font-family:Times New Roman" SIZE="2">John</FONT></TD></TR>
</TABLE>

<p Style='page-break-before:always'>
<HR  SIZE="3" style="COLOR:#999999" WIDTH="100%" ALIGN="CENTER">

The html code which i have copied here is just the sample.I need the exact content between the <hr> in the @elementsarray.

an example with expected output would be better. Did you want a grep solution? — Avinash Raj, Jan 29 '15 at 12:01
I want to create a new html with the content between `
` and `` from the existing html. — user3431651, Jan 29 '15 at 12:06
Then perhaps better use `sed -ri 's#
(.*)#
newcontents#g' file.html`. it replaces contents between each hr-tag with *newcontents*. or do you "need" a perl variant for that? — Marc Bredt, Jan 29 '15 at 12:12
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — choroba, Jan 29 '15 at 12:17
If you can give us example source and example output, this will get much easier to answer. — Sobrique, Jan 29 '15 at 12:46
Yes. Edit your post, and put it in 'code' quotes - e.g new line, then indent by 4 spaces. — Sobrique, Jan 29 '15 at 13:06
Whatever you got there, these aren't proper html files. [`
` is defined to be empty](http://www.w3.org/TR/2014/REC-html5-20141028/grouping-content.html#the-hr-element). Since you have something strange going on there, if you throw that into different browsers, you'll probably get different results. — Patrick J. S., Jan 29 '15 at 14:40
The format of the HTML code you posted is inconsistent with the description of the problem you are trying to solve. There is no `` so searching and trying to match text between `
` and `` will not get you any results. — tjwrona1992, Jan 30 '15 at 20:37

score 1 · Answer 1 · answered Jan 29 '15 at 14:47

In a simplest way you may do this:

my @cont;
foreach (@ARGV) {
  open my $fh,'<',$_;
  push @cont,join('',map { chomp; $_ } <$fh>)=~m%<hr>(.*?)</hr>%g;
}
#print join("\n",@cont,'');

And yes, dont worry: all files will be closed on exit "automagically" :)

Hint: uncomment print statement to see the result.

score 1 · Answer 2 · answered Jan 29 '15 at 21:29

You can use grep in the command line:

grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' file.html

This will allow you to extract anything between <hr> and </hr> even if new lines are present.

Example:

tiago@dell:/tmp$ grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' <<< '<hr>a b c d </hr>'
a b c d 
tiago@dell:/tmp$ grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' <<< $'<hr>a b\nc d </hr>'
a b
c d

And of course you can run grep against multiple files.

tjwrona1992 · Accepted Answer · 2015-01-29T14:38:49.183

0

I know people say not to parse HTML with a regex, but this seems like the kind of relatively simple task that warrants the use of a regex.

Try this:

if ($line =~ m/<hr>(.*?)<\/hr>/){
    push @elements, $1; 
}

This will extract the text between <hr> and </hr> and store it in the next index in the @elements array.

Also you should ALWAYS use strict; and use warnings; at the top of your code! This will stop you from making dumb mistakes and prevent many needless headaches down the road.

You should also close your file after you are done extracting its contents into the @useful_files array! close $fh;

(On a side note, the name of this array is misleading. I would suggest you name it something like @lines or @file_contents since it contains the contents of a single file... not multiple files as your variable name seems to suggest.)

edited Jan 29 '15 at 14:38

answered Jan 29 '15 at 14:16

tjwrona1992

8,614
8
35
98

If you use a lexical filehandle (with `open my $fh, '<', $filename or die`) the filehandle is autoclosed when it goes out of scope. – Patrick J. S. Jan 29 '15 at 14:32
In order for it to be lexical there has to be a `my` there, which in his case there is not. His file handle is global to the whole script. It is also bad practice even with the use of a lexical file handle to wait until it goes out of scope to close it. A file handle should ALWAYS be closed the second it is done being used. – tjwrona1992 Jan 29 '15 at 14:34
A coworker of mine actually ran into an issue this morning with a file handle failing to automatically close. He wrote a function and accidentally had it return before closing the file and Perl did not auto-close the file handle for him. He spent all morning tracking down the issue. It is always better to explicitly state things such as closing a file instead of expecting Perl to clean up for you behind the scenes. Things don't always work the way you expect them to work and if you rely too heavily on Perl's internal "magic" that you can't see, you may run into some nasty unexpected surprises. – tjwrona1992 Feb 04 '15 at 17:43
Whatever he did, he didn't use a lexical filehandle that ran out of scope. It could be that he had circular references, or some some other nasty things, but if an `$fh`'s `DESTROY` gets called, it will be closed. I just checked a few edge cases, but came up with none where the handle had a reference count of `0` and wasn't closed (checked with `lsof`) What I didn't do is manipulate the internals. So if you run into cases where this should matter, you better had been knowing what you're doing in the first place. – Patrick J. S. Feb 04 '15 at 20:57
Now that I think about it I don't think he did use a lexical file handle which is not good practice either. In @user3431651's case the file handle is not lexical either (he never used `strict` and declared it globally). The file would be open until the end of his script's execution. But even if he was using a lexical file handle, the scope of a lexical file handle could be large so even in the case of a lexical file handle it may be open for quite some time. It is still the best practice to close it manually immediately after you are done using it regardless of what Perl will do with it later. – tjwrona1992 Feb 05 '15 at 13:33

How to get strings inbetween two Strings

3 Answers3