0

Requirement : I have following data to match with regEX. I need to get Name 1, Name 2, Name 3 and Name 4.

Some Conditions :

  1. $regex need to consider that Name will always come after <H2>Composition<\H2>
  2. There could be any number of Name i.e. It could happen that after Composition there is only one pattern say Name1 or two pattern Name1 and Name2.
  3. At least one Name pattern will be present after Composition. So regex should be like "Composition is present then Name1 will be surely there"

Example:

 <H2>Composition</H2>
 <A href="/generics/levocetrizine-210129">Name 1</A>,
 <A href="/generics/paracetamol-210459">Name 2(500 mg)</A>,
 <A href="/generics/phenylephrine-hydrochloride-210494">Name 3</A>,
 <A href="/generics/ambroxol-hydrochloride-211798">Name 4</A></DIV></DIV></DIV></DIV>

So far, I could only be able to get first Name i.e. Name1 via following script. My script simply ignores rest of "Name" i.e. in above case, Name2, Name3 and Name4 are missing from my output.

[regex]$regex = 
@'
(?s).+?<H2>Composition</H2>.*?href="/generics/.*?">(.*?)</A>
'@
jessehouwing
  • 106,458
  • 22
  • 256
  • 341
Powershel
  • 615
  • 4
  • 11
  • 18

1 Answers1

2

This problem is much easier to solve with a XPath Expression or a little bit of C# against the HTML Agility pack. Regular Expressions are going to be a major pain, though in this case you might be able to make them work.

With the HTML Agility Pack it would be something like:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourString);

string xpath = "//H2[contains(text(), 'Composition')]/following-sibling::A[contains(@href, '/generics/']";

var nodes = doc.DocumentNode.SelectNodes(xpath);
foreach (var node in nodes)
{
    string name = node.InnerText;
    string uri = node.Attributes["href"].Value;
}

Conversion to Powershell from this little C# snippet should not be hard.

Using Regex is going to be a pain in the long run, it's not meant to do HTML parsing or parsing of a structured document like HTML or XML.

if you really want to take the awful, bad, not good, horrible, regex way, try something like this:

(?i)<h2>composition</h2>(?:(?:(?!<a).*)<a href="/generics/[^"]+">(?<name>(?!</a).*)</A>)*

And use the .NET regex capability to grab the Captures:

([regex]$regex).Match("$content").Groups['name'].Captures
Community
  • 1
  • 1
jessehouwing
  • 106,458
  • 22
  • 256
  • 341