0

I have a data set in the following pattern

1<a href="/contact/">Joe</a><br />joe.doe@somemail.com</div>
2<a href="/contact/">Tom</a><br />tom.cat@aol.com</div>
3<a href="/contact/">Jerry</a><br />jerry.mouse@yahoo.co.in</div>

So on...

I need to extract the name and email id alone from it. How do I do it?


Update:

Based on your responses, I've changed my data format to:

1(name)Joe(email)joe.doe@somemail.com(end)
2(name)Tom(email)tom.cat@aol.com(end)
3(name)Jerry(email)jerry.mouse@yahoo.co.in(end)

How do I parse that?

Community
  • 1
  • 1
Ragunath Jawahar
  • 19,513
  • 22
  • 110
  • 155

3 Answers3

1

Don't use regular expressions to parse HTML.

Use an HTML parser. There are a bunch listed on this page. Based on my experience using Tidy, I would suggest JTidy. From their page:

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

UPDATE

Based on the edit to your question, use split() to split the string with \([a-z]+\) as a delimiter. This should give you the separate components:

String[] components = str.split("\\([a-z]+\\)");

Or you could use the more generic expression \(.*?\).

Community
  • 1
  • 1
Vivin Paliath
  • 94,126
  • 40
  • 223
  • 295
1

Use this regex:

\(name\)(.*)\(email\)(.*)\(end\)

Now, the first backreference \1 contains the name, and the second backreference \2 contains the email address.

Keep calling the same regex to get the next name and email address.

Chetan
  • 46,743
  • 31
  • 106
  • 145
1

If you are guaranteed that this will be the standard pattern for all of your entries, you can simply use String.split() on each line, using the regular expression (.*?) as the split pattern. This will match the ( followed by the least possible number of other characters, followed by another ). So the code looks something like this:

//for each String line
String[] items = line.split("\\(.*?\\)");
name = items[0];
email = items[1];
Zoe
  • 1,833
  • 1
  • 16
  • 18