Parse text email addresses using XPath, NOT //A[startswith(@href, 'mailto:')]

Question

I want to extract email addresses from few different websites. If they are in active link format, I can do this using

//A[starts-with(@href, 'mailto:')]

But some of them are in just text format example@domain.com, not a link, so I would like to select a path to element that contains @ inside

See http://stackoverflow.com/questions/535600/ruby-email-check-rfc-2822 — Reactormonk, Apr 11 '12 at 09:01
I have no idea what to use, I've tried tried everything that came to my mind but nothing worked. — Gargamel, Apr 11 '12 at 09:23

Dimitre Novatchev · Accepted Answer · 2012-04-11T13:20:16.273

5

I would like to select a path to element that contains @ inside

Use:

//*[contains(., '@')]

It seems to me that what you actually wanted is to select elements that have a text-node child that contains "@". If this is so, use:

//*[contains(text(), '@')]

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     <xsl:copy-of select=
        "//*[contains(text(), '@')] "/>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the following XML document:

<html>
 <body>
  <a href="xxx.com">xxx.com</a>
  <span>someone@xxx.com</span>
 </body>
</html>

the XPath expression is evaluated and the selected nodes are copied to the output:

<span>someone@xxx.com</span>

edited Apr 11 '12 at 13:20

answered Apr 11 '12 at 12:30

Dimitre Novatchev

240,661
26
293
431

//*[contains(., '@')] selects everything on a website – Gargamel Apr 11 '12 at 12:41
1

@Gargamel: No, it only selects element nodes, whose string value contains the "@" character -- this is what you said you wanted... – Dimitre Novatchev Apr 11 '12 at 12:48
When I use for example "//p[contains(., '@')]" then the selection contain everything from
tag "
Mail: example@gmail.com
Skype: loginskype
, when I use //*[contains(., '@')] then the selection contains everything from the website. – Gargamel Apr 11 '12 at 12:59
@Gargamel: Yes, but this is what you wanted. Try: `//*[contains(text(), '@')]` -- I edited the answer. – Dimitre Novatchev Apr 11 '12 at 13:02
Unfortunately no difference, still selecting everything on a website. – Gargamel Apr 11 '12 at 13:09
@Gargamel: That means that *every* element has a text-node child that contains "@". If you edit the question and provide a small XML document on which the XPath expression selects every element, we will clearly see that, if really this is the case, then every element in this document has a text-node-child containing "@". – Dimitre Novatchev Apr 11 '12 at 13:15
@Gargamel: See the update to the answer -- only a single element is selected. – Dimitre Novatchev Apr 11 '12 at 13:21

score 4 · Answer 2 · answered Apr 11 '12 at 12:20

You'll probably want to use a regular expression. They'll allow you to extract the email addresses, regardless of their context within a document. Here is a little test-driven example to get you started:

require "minitest/spec"
require "minitest/autorun"

module Extractor
  EMAIL_REGEX = /[\w]+@[\w]+\.[\w]+/

  def self.emails(document)
    (matches = document.scan(EMAIL_REGEX)).any? ? matches : false
  end
end

describe "Extractor" do
  it 'should extract an email address from plaintext' do
    emails = Extractor.emails("email@example.com")
    emails.must_include "email@example.com"
  end

  it 'should extract multiple email addresses from plaintext' do
    emails = Extractor.emails("email@example.com and email2@example2.com")
    emails.must_include "email@example.com", "email2@example2.com"
  end

  it 'should extract an email address from the href attribute of an anchor' do
    emails = Extractor.emails("<a href='mailto:email3@example3.com'>Email!</a>")
    emails.must_include "email3@example3.com"
  end

  it 'should extract multiple email addresses from both plaintext and within HTML' do
    emails = Extractor.emails("my@email.com OR <a href='mailto:email4@example4.com'>Email!</a>")
    emails.must_include "email4@example4.com", "my@email.com"
  end

  it 'should not extract an email address if there isn\'t one' do
    emails = Extractor.emails("email(at)address(dot)com")
    emails.must_equal false
  end

  it "should extract email addresses" do
    emails = Extractor.emails("email.address@domain.co.uk")
    emails.must_include "email.address@domain.co.uk"
  end
end

The last test fails because the regular expression doesn't anticipate the majority of valid email addresses. See if you use this as a starting point to come up with or find a better regular expression. To help build your regular expressions, check out Rubular.

This is a really good answer. Here's a short one that passes your tests: /[\w._%+-]+@[\w._%+-]+\.[\w._%+-]+/ — pguardiario, Apr 11 '12 at 14:08

Parse text email addresses using XPath, NOT //A[startswith(@href, 'mailto:')]

2 Answers2