0

I'm trying my hand on Scala regex to find img src in a web page. Using the following code and a mock content, I don't get any match. What am I missing?

def imgSrc(content: String) = {
  val src = ".*<img[\\w\\s]+src\\s*=\\s*(\"\\w+\")[\\w\\s]+/>.*".r
  val formattedContent = content.replaceAll(lineSeparator, "")

  (src findAllIn formattedContent).toList
}

Test case:

"Method imgSrc" should "find src attributes of all img tags in mock web page" in {
  val content = """<a href="#search" onclick="_gaq.push(['_trackPageview', '/search']); 
                    return Manager.createHistoryAndLoad(true);">
                    <img src="ajaxsolr/images/centralRepository_logo.png" alt="The Central Repository" />
                  </a>"""
  imgSrc(content) should contain("ajaxsolr/images/centralRepository_logo.png")
}

Also, it'd be nice to be able to match the multiline input without removing the newlines. I read this and this but couldn't get it to work.

Note: This is just a learning exercise. I'm aware and generally agree that one shouldn't use regex to parse HTML.

Community
  • 1
  • 1
Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219

1 Answers1

4

This works on your input:

scala> def imgSrc(content: String) = {
     |   val src = """(?s)<img\s[^>]*?src\s*=\s*['\"]([^'\"]*?)['\"][^>]*?>""".r
     |   src findAllMatchIn content map (_.group(1)) toList
     | }
imgSrc: (content: String)List[String]

scala> imgSrc(content)
res13: List[String] = List(ajaxsolr/images/centralRepository_logo.png)

But I would recommend you to use some normal HTML parser, like Jsoup:

 val doc = Jsoup.parse(content);
 val img = doc.select("img").first();
 val src = img.attr("src");
dk14
  • 22,206
  • 4
  • 51
  • 88
  • This gets the whole image tag, not the src which's what I want. I'm aware of jsoup, this is just a learning exercise. – Abhijit Sarkar May 04 '15 at 05:21
  • Thank you, I've accepted and upvoted your answer. I've to look into the `findAllMatchIn` method. Can I not get the groups from `findAllIn matchData`? – Abhijit Sarkar May 04 '15 at 05:26
  • `src findAllMatchIn content` and `(src findAllIn content).matchData` return the same `Iterator[Match]`, but first one looks better for Scala syntax – dk14 May 04 '15 at 05:32