0

I have not used perl regex before and am struggling to only extract the first occurrence from some html. The html concerned is as follows:

<tr><th class="a-span5 a-size-base">Model number</th><td class="a-span7 a-size-base">MK6174</td></tr><tr><th class="a-span5 a-size-base">Part Number</th><td class="a-span7 a-size-base">MK6174</td></tr>

I am trying to extract only the first match of MK6174.

The current regex I have come up with is

([A-Z0-9_.\/-]{6,})

(I have many other model numbers ranging from 6+ characters long, all being alphanumeric. (a few contain the special characters above)

From my research I understand I need to somehow use .*? or .+? to make it non-greedy, but I cannot find the correct place to put this?

I find if I put it at the end then it makes no difference, and anywhere inbetween it ends with 0 results.

I have also tried adding /g onto the end to see if that works and that yields zero results.

Also, I need to make it so this works all in one regex, as the program I am using to do this doesn't work when I enter one regex to filter out the model number html code (up to the closing td tag), then the above regex to pull the MK6174 out.

What could I do to fix this?

Edit: I forgot to mention that with my current regex, my output result is 'MK6174MK6174'

GMB
  • 216,147
  • 25
  • 84
  • 135
Jeremy
  • 434
  • 1
  • 4
  • 17
  • Your regex looks fine to capture the given string from your sample data. Please show us how you use the regex. – GMB Sep 15 '19 at 13:04
  • Hi, I am using diggernaut and using a the same format as they have, using the filter: ... bit and I put my regex there - https://www.diggernaut.com/dev/meta-language-methods-working-with-register-parse.html – Jeremy Sep 15 '19 at 13:12
  • Mmm, so how is this related to the `perl` development language, that you tagged your question with? – GMB Sep 15 '19 at 13:16
  • Because they use perl? - if I force an error message it says "error parsing regexp: invalid or unsupported Perl syntax: `..." – Jeremy Sep 15 '19 at 13:19
  • 1
    Your regex looks fine to me. In plain perl, you could use an expression like `my ($match) = ($str =~ /([A-Z0-9_.\/-]{6,})/);` to capture the first match (given your input string as `$str`, this would return `MK6174`). I don't know how diggernaut does it though. Since there is no diggernaut tag her on SO, I have added the name in the title of your question. Hopefully someone who know about diggernaut will pick this up. – GMB Sep 15 '19 at 13:38
  • 1
    [Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. – Toto Sep 15 '19 at 15:05
  • After a bit of digging (pun intended) I am pretty sure Diggernaut is written in Perl. There is a chance the person running the business might come across this here. However, my best bet given you are probably paying for the service is to ask within their community, or to raise a ticket through their website. Since this is a commercial, closed-source product we cannot look into its code to see how it works. Therefore we can only guess unfortunately. – simbabque Sep 16 '19 at 10:59

1 Answers1

1

Using CSS to select the td adjacent to the Model number th:

use Web::Query::LibXML 'wq';
my $html = '<tr><th class="a-span5 a-size-base">Model number</th><td class="a-span7 a-size-base">MK6174</td></tr><tr><th class="a-span5 a-size-base">Part Number</th><td class="a-span7 a-size-base">MK6174</td></tr>';
print wq($html)->find('th:contains("Model number") + td')->text;
__END__
MK6174
daxim
  • 39,270
  • 4
  • 65
  • 132