0

Been struggling with this for a couple of hours now...

I have the following regex:

(?<=\bdata-video-id=""."">)(.*?)(title=.*?>)

The following input:

         <div class="cameras">            
            <table class="results">
                <colgroup>
                    <col class="col0">
                    <col class="col1">
                </colgroup>
                <thead>
                    <tr>
                        <th title="Name">
                            Name
                        </th>
                        <th title="Date">
                            Date
                        </th>
                    </tr>
                </thead>
                <tbody>
                    <tr data-video-id="1">
                        <td title="149 - Cam123">
                            149 - Cam123
                        </td>
                        <td title="Feb 18 2013">
                            Feb 18 2013
                        </td>
                    </tr>
                    <tr data-video-id="2">
                        <td title="150 - Cam456">
                            150 - Cam456
                        </td>
                        <td title="Feb 18 2013">
                            Feb 18 2013
                        </td>
                    </tr>                   
                </tbody>
            </table>
        </div>

The regex outputs this:

<td title="149 - Cam123">
<td title="150 - Cam456">

But what I'd like to get is the contents of the title attribute of the 1st cell from every table row:

149 - Cam123
150 - Cam456

The number of rows may obviously vary but the number of columns is fixed. Please help me fine tune the above regex. Thanks

NOTE: The solution MUST be a regular expression. I do not have access to the code base therefore an HTML parser or any other kind of code intervention is not possible. The only way I can hook into the application is by injecting a different regex.

Tsef
  • 1,018
  • 9
  • 22
  • 1
    in what language? also where is input? – Kent Feb 20 '13 at 15:57
  • Why a Regex? Use a html parser. – Rich O'Kelly Feb 20 '13 at 15:57
  • 2
    **[obligatory ͠P̯͍̭O̚​N̐Y̡ link](http://stackoverflow.com/a/1732454/664108)** – Fabian Schmengler Feb 20 '13 at 16:01
  • 1
    I found a rule, any regex + xml/html codes questions would be commented/answered with "why not a parser" yes, if parsing the whole xml/html document, regex won't be the right choice. however, in many cases, our program reading a **part** of text, which are some html/xml elements in certain format. in this case, regex does work. Also works for very simple xml/html structure with fixed format case. or we have to import a new library and write dozen lines codes just for getting an attribute. well my 2 cents... – Kent Feb 20 '13 at 16:07
  • Please post the input. – rrrr-o Feb 20 '13 at 16:28
  • input posted along with a short explanation – Tsef Feb 20 '13 at 19:24

1 Answers1

0

Based on the OP requirements that it MUST be a regex, then my suggestion would be to add a group wrapper to the inner title information:

(?<=\bdata-video-id=""."">).*?title="(.*?)">

Otherwise, the general solution is to not use a regex:

Why are you using a regex? The typical solution for this due to the complexities of the tags is to use an HTML parser

Here is a SO about this topic

Here is another even more popular response on using regex for XHTML which was pointed out by Jeff Atwood in this blogpost

Community
  • 1
  • 1
Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
  • Yep, I know. However this is something running on customer premise and I cannot change the code base. It has some kind of regular expression engine which I can inject regexes into without touching the code base. – Tsef Feb 20 '13 at 16:03
  • I have updated my answer in that case. Please be careful with this per the links I provided as regex is not suggested here – Justin Pihony Feb 20 '13 at 16:07
  • hmmm...that didn't do the trick. can you please post the full regex, I'm not too much of a regex guy... – Tsef Feb 20 '13 at 16:18
  • @zaf As rrrr commented, could you post the input. I think I was going off of the wrong input – Justin Pihony Feb 20 '13 at 17:01
  • input posted along with a short explanation – Tsef Feb 20 '13 at 19:25