4

I don't understand how to use pattern matching for two or more regular expressions. For instance, I wrote the following program:

import scala.io.Source.{fromInputStream}
import java.io._
import java.net._
object craw
{
  def main(args: Array[String])
  {
    val url=new URL("http://contentexplore.com/iphone-6-amazing-looks/")
    val content=fromInputStream(url.openStream).getLines.mkString("\n")
    val x="<a href=(\"[^\"]*\")[^<]".r.
      findAllIn(content).
      toList.
      map(x=>x.substring(16,x.length()-2)).
      mkString("").
      split("/").
      mkString("").
      split(".com").
      mkString("").
      split("www.").
      mkString("").
      split(".html").
      toList
    print(x)
  }
}

The above reads in all the anchor tags.

import scala.io.Source.{fromInputStream}
import java.io._
import java.net._
object new1
{
  def main(args: Array[String])
  {
    val url=new URL("http://contentexplore.com/iphone-6-amazing-looks/")
    val content=fromInputStream(url.openStream).getLines.mkString("\n")
    val x="<p>.*?</p>".r.
      findAllIn(content).
      toList.
      map(x=>x.substring(3,x.length()-4)).
      mkString("").
      split("</strong>").
      mkString("").
      split("</em>").
      mkString("").
      split(";").
      mkString("").
      split("<em>").
      mkString("").
      split("<strong>").
      mkString("").
      split("&nbsp").
      toList
    print(x)
  }
}

The above reads in all the paragraph tags.

I want to combine these two regular expressions into a single program, using pattern matching. Can guide me regarding how to use more than two regular expressions?

NOTE This question has to do with the combining regular expressions, and not with how to efficiently parse HTML.

Nathaniel Ford
  • 20,545
  • 20
  • 91
  • 102
shashank
  • 379
  • 5
  • 6
  • 15
  • 4
    [You shouldn't parse html with regex](http://stackoverflow.com/a/1732454/406435). Instead you could use [html parser](http://blog.dub.podval.org/2010/08/scala-and-tag-soup.html). – senia May 26 '13 at 04:55
  • Thank you so much senia but as i am a beginner i am unable to follow the example which you have given.Kindly suggest me how to write more than one regular expressions in patternmatching . for instance i want to write a case which retrieves all the anchor tags and other one to retrieve paragraph tags.Kindly suggest me.Thank you – shashank May 26 '13 at 05:32
  • 1
    again: don't parse html with regex, I found jsoup(http://jsoup.org/cookbook/) to be very simple to use. It only requires that you know how to port examples from java to scala, which is simple for that case. – Johnny Everson May 26 '13 at 13:12

1 Answers1

4

As noted in comments, it is not recommended to use regex to parse HTML files(or any other technique unless you are sure you can't/don't want to use some of the existing ones, like jsoup).

For educational purposes, here is one way to chain regex with pattern matching(using regex as extractors):

val LinkPattern = "<a href=(\"[^\"]*\")[^<]".r
val ParagraphPattern = "<p>.*?</p>".r
xmlNodeString match {
   case LinkPattern(c) => //c bound to capture group here
   case ParagraphPattern(d) => //d bound to capture group here
   case _ =>
}

note: this assumes each individual node you are parsing is xmlNodeString, so you would need to traverse the XML nodes, matching one at a time.

Johnny Everson
  • 8,343
  • 7
  • 39
  • 75