-3

I need to filter links and images from html pages with c++ and regex and I came up with this phrase:

<\s*(a.*?href|img.*?src)\s*=\s*\"(.*?)\".*?\s*> 

unfortunately this will also find links and images within comments which it shouldn't. I tried some negative look-aheads without success.

halfer
  • 19,824
  • 17
  • 99
  • 186
Doodle
  • 141
  • 2
  • 9
  • 4
    please read this once: https://stackoverflow.com/a/1732454/2815219 – Raman Sahasi Jul 22 '17 at 18:17
  • I need to extract all links and images from websites for a webcrawler project for my university. <\s*(a.*?href|img.*?src)\s*=\s*\"(.*?)\".*?\s*> extracts all links and images but we shouldnt get those within comments. For example the this regex will find which it should as well as which it shouldn't – Doodle Jul 22 '17 at 18:18
  • 1
    Don't use regex for that. Use a proper HTML parser. – Jesper Juhl Jul 22 '17 at 18:22
  • unfortunately we are not allowed to use a HTML parser – Doodle Jul 22 '17 at 18:25
  • Why can you not use an HTML parser? – halfer Jul 22 '17 at 19:17
  • 2
    That's an insane requirement. Parsing general HTML is not a suitable job for a regex. My suggestion i is to use a regex to remove HTML comments and CDATA sections and then search - but I'm sure that won't handle all the cases. Note that links can be surrounded by single quotes as well as double. I'm sure I've forgotten some other gotchas – Martin Bonner supports Monica Jul 22 '17 at 19:23
  • @Casimir: possibly, though academic institutions are rather known for placing entirely unrealistic or daft limitations on assignments, such that they become rather poor examples of how to best solve the problem `:o)`. – halfer Jul 22 '17 at 21:14
  • @halfer: there's indeed a lot of pedagogical wares in books/tutorials and other, that choose html for training ground (it's clearly due to a lack of imagination.). It's sad because there's a lot of *real life* and more useful possible examples. But it isn't only a regex problem, think about oop or database tutorials with unrealistic example about cars with number of doors, colors, speed... Authors such lords speak to the peasants of the Middle Ages. – Casimir et Hippolyte Jul 22 '17 at 21:22

1 Answers1

0

There's no reason to do everything at once. Also, you didn't say what environment/editor/programming language, so I picked my favorite, C#.

  1. Remove all comments:

using

var s1 = source.Replace("<!--.*?-->", "");
  1. Extract links with your existing regex:

using

var s2 = Regex.Matches(s1, "<\\s*(a.*?href|img.*?src)\\s*=\\s*\"(.*?)\".*?\\s*> ");
NetMage
  • 26,163
  • 3
  • 34
  • 55