0

I am in need of some help correcting my RegEx string - I have a string of text (A large body of HTML) and I need to take this HTML String and then pattern match it so that data that I have nested within' <div> tags can be extracted and used.

Lets take an example with a test case of <div id=1>

<div id=1>UID:1currentPartNumber:63222TRES003H1workcenter:VLCSKDcycleTime:98.8curPartCycleTime:63.66partsMade:233curCycleTimeActual:62.4target:291actual:233downtime:97statusReason:lineStatus:Productionefficiency:80.05plusminus:-260curProdTime:7/16/2019 12:28:01 PM</div>

What should be noted is that lineStatus can either have a value or be empty such as the same with statusReason

I am able to come up with a regex that does MOST of the work but I am struggling with cases where values are not present.

Here is my attempt:

(
(<div id=(\d|\d\d)>)
(UID:(\d|\d\d))
(currentPartNumber:(.{1,20}))
(workcenter:(.{1,20}))
(cycleTime:(.{1,6}))
(curPartCycleTime:(.{1,6}))
(partsMade:(.{1,6}))
(CycleTimeActual:(.{1,6}))
(target:(.{1,6}))
(actual:(.{1,6}))
(downtime:(.{1,6}))
((statusReason:((?:.)|(.{1,6}))))
((lineStatus:((?:.)|(.{1,6}))))
(Productionefficiency:(.{1,6}))
(plusminus:(.{1,6}))
(curProdTime:(.{1,30}))
)

Split it up just for readability.

Thanks,

HRD1997BFBE
  • 113
  • 10
  • 1
    One big issue I think is that `(currentPartNumber:(.{1,20})` captures too much since `workcenter` appears before the 20th position. This is probably going to be true for other matches as well. – MonkeyZeus Jul 16 '19 at 19:24
  • Thanks for your input, to add to that though doesn't the capture group stop when another begins?? Essentially the only reason I did `{1,20}` is because another closure group will stop the match after it matches the latter closure group no?? – HRD1997BFBE Jul 16 '19 at 19:32
  • 1
    Hmm, you are correct. I forgot about that convenient feature. – MonkeyZeus Jul 16 '19 at 19:43

2 Answers2

1

Try Regex: ((<div id=(\d|\d\d)>)(UID:(\d|\d\d))(currentPartNumber:(.{1,20}))(workcenter:(.{1,20}))(cycleTime:(.{1,6}))(curPartCycleTime:(.{1,6}))(partsMade:(.{1,6}))(CycleTimeActual:(.{1,6}))(target:(.{1,6}))(actual:(.{1,6}))(downtime:(.{1,6}))(statusReason:(.{1,6})?)(lineStatus:(.{1,6})?)(Productionefficiency:(.{1,6}))(plusminus:(.{1,6}))(curProdTime:(.{1,30})))

Demo

Warning: You can't Parse HTML with regex

Matt.G
  • 3,586
  • 2
  • 10
  • 23
  • The Regex String works - the opinion piece doesn't directly from the post you linked me (sans top answer because its useless to me) "While it is true that asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML. (Insert more words) .... Regexes worked just fine for me, and were very fast to set up." Thanks for the fix though Matt – HRD1997BFBE Jul 16 '19 at 19:41
1

You are very, very close.

If you use:

(
(<div id=\d{1,2}>)
(UID:\d{1,2})
(currentPartNumber:(.{1,20}))
(workcenter:(.{1,20}))
(cycleTime:(.{1,6}))
(curPartCycleTime:(.{1,6}))
(partsMade:(.{1,6}))
(CycleTimeActual:(.{1,6}))
(target:(.{1,6}))
(actual:(.{1,6}))
(downtime:(.{1,6}))
(statusReason:(.{0,6}))
(lineStatus:(.{0,6}))
(Productionefficiency:(.{1,6}))
(plusminus:(.{1,6}))
(curProdTime:(.{1,30}))
(<\/div>)
)

Then $3\n$4\n$6\n$8\n$10\n$12\n$14\n$16\n$18\n$20\n$22\n$24\n$26\n$28\n$30 will be:

UID:1
currentPartNumber:63222TRES003H1
workcenter:VLCSKD
cycleTime:98.8
curPartCycleTime:63.66
partsMade:233cur
CycleTimeActual:62.4
target:291
actual:233
downtime:97
statusReason:
lineStatus:
Productionefficiency:80.05
plusminus:-260
curProdTime:7/16/2019 12:28:01 PM

By using (statusReason:(.{0,6}))(lineStatus:(.{0,6})) you make the value of statusReason and lineStatus truly optional.

I also simplified the start <div> and UID detection.

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77