0

i have a html code something like this. I'm using Node.js for web scraping.

<div id="content_column">

<CENTER>01/23/2014</CENTER>
<BR> <B>Name : </B> GLUCK MARTIN <BR> <B>Address : </B> <BR>
<B>Profession : </B> MEDICINE <BR> <B>License No: </B> 077798 <BR>
<B>Date of Licensure : </B> 05/05/56 <BR> <B>Additional
    Qualification : </B> &nbsp; <BR> <B> <A
    href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> DECEASED
11/24/13 <BR> <B>Registered through last day of : </B> <BR> <B>Medical
    School: </B> UNIVERSITY OF GENEVA <B>&nbsp;&nbsp;&nbsp; Degree Date :
</B> Not on file <BR>
<HR>
<div class="note">
    (Use your browser's back key to return to licensee list.)<BR> <BR>
    * Use of this online verification service signifies that you have read
    and agree to the <A href="http://www.op.nysed.gov/usage.htm">terms
        and conditions of use</A>. See <A href="http://www.op.nysed.gov/help.htm">HELP
        glossary</A> for further explanations of terms used on this page. <BR>
    <BR> <B>Note: </B> The Board of Regents does not discipline <i>physicians(medicine),
        physician assistants,</i> or <i>specialist assistants.</i> The status of
    individuals in these professions may be impacted by information
    provided by the NYS Department of Health. To search for the latest
    discipline actions against individuals in these professions, please
    check the New York State Department of Health's <A
        href="http://www.health.state.ny.us/nysdoh/opmc/main.htm"> Office
        of Professional Medical Conduct</A> homepage.
    </UL>
</div>
<HR>
<div class="note">
    Further information on physicians may be found on the following
    external sites (The State Education Department is not responsible for
    the accuracy or completeness of information located on external
    Internet addresses.): <BR> <BR> <a
        href="http://www.abms.org/">American Board of Medical Specialties</a>
    <BR> <BR> <a href="http://www.ama-assn.org/">American
        Medical Association:</a> <BR> - For the general public: <a
        href="http://www.ama-assn.org/aps/amahg.htm">AMA Physician
        Select, On-line Doctor Finder</a><BR>&nbsp;&nbsp;&nbsp; <BR> -
    For organizations that verify physician credentials: <a
        href="http://www.ama-assn.org/physdata/physrel/physrel.htm">AMA
        Physician Profiles</a> <BR> <BR> <a
        href="http://www.aoa-net.org/">American Osteopathic Association,
        AOA-Net</a> <BR> <BR> <a href="http://www.docboard.org/">Association
        of State Medical Board Executive Directors-(A.I.M."DOCFINDER")</a> <BR>
    <BR> <a href="http://www.nydoctorprofile.com/welcome.jsp">New
        York State Department of Health Physician Profiles</a><BR> <BR>The
    following sites provide additional information concerning the medical
    profession: <BR> <BR> <a href="http://www.clearhq.org/">CLEAR
        (Council on Licensure, Enforcement and Regulation)</a> <BR> <BR>
    <a href="http://www.fsmb.org/">Federation of State Medical Boards</a><BR>
    <BR>
</div>
<CENTER>
    <BR> <IMG SRC="http://www.op.nysed.gov/Sedseal.jpg" WIDTH="100"
        HEIGHT="101" ALT="Seal of the State Education Department"><BR>
    <BR>
</CENTER>
</div>

how can I find those values that are not within any element, in this case they are GLUCK MARTIN, MEDICINE ,077798 ,05/05/56, and so on.

Peter Huang
  • 972
  • 4
  • 12
  • 34

3 Answers3

0

It's easy with jQuery - combination of not and contains :

$("#content_column").not(":contains('GLUCK MARTIN')")
Goran.it
  • 5,991
  • 2
  • 23
  • 25
  • Unless I misunderstood I don't think he/she is asking how to find elements that don't contain the phrase "GLUCK MARTIN." I believe he/she is asking how to find text that is not surrounded by any tags. – krispy Jan 23 '14 at 17:09
  • In that case he/she would give code without the quoted string to search for I believe, I maybe misunderstood the question (not sure 100%). – Goran.it Jan 23 '14 at 17:12
0

Refer to this answer:

$('#content_column').clone().children().remove().end().text()

Here is a fiddle with your example markup.

Community
  • 1
  • 1
Goose
  • 3,241
  • 4
  • 25
  • 37
0

In node I'd recommend doing scrape work like this with a DOM instead of something like regex. jsdom is a good one that will allow you to build a DOM out of your fragment. From there, you can query document.documentElement (in my example I'll be using jquery) and pull out any direct text-nodes not wrapped in a tag.

// Count all of the text not in a tag
var jsdom = require("jsdom");

jsdom.env(
  "URL OR YOUR HTML STRING HERE",
  ["http://code.jquery.com/jquery.js"],
  function (errors, window) {
    var textNodes = window.$(window.document.documentElement)
        .find(":not(iframe)")
        .addBack()
        .contents()
        .filter(function() {
            return this.nodeType == 3;
        });
    //do something with textNodes
  }
);
PP_vinnieg
  • 224
  • 3
  • 6