1

So my problem is that I have extracted a lot of forum posts into separate txt files which are now on my harddrive. Each file contains information I would like to extract, some of which I already have figured out how to extract. The information I need to extract is in the following form:

Within the same "html block"

1: (x) messages in this thread
2: Message is in reply to (some html code) A HREF="link" (some html code=

In task 1 is simply need to extract x
In task 2 i need to extract the links to which the message is in reply to

I have looked into the different tm and XML packages but have not been able to actually find out what to use. Any advice is appreciated.

This is what one of the txt files looks like

`<HTML>
<HEAD>
<TITLE>Dear LEGO : 5668 </TITLE>
<META NAME="ROBOTS" CONTENT="ALL, INDEX, FOLLOW">
<META NAME="KEYWORDS" CONTENT="lego, legos, legoland, toy, construction, community, education, technic, mindstorms, toolo, duplo, primo, dacta">
<META NAME="DESCRIPTION" CONTENT="Dear LEGO : 5668 - LUGNET: The international fan-created LEGOÆ Users Group Network. A place for LEGOÆ fans of all ages to find information, meet one another, and share ideas. As an independent site by fans, for fans, it is neither sponsored nor endorsed by the LEGO Company.">
<SCRIPT LANGUAGE="JavaScript" SRC="http://www.lugnet.com/js/common.js"></SCRIPT>
</HEAD>

<BODY
 LEFTMARGIN=0 TOPMARGIN=0 MARGINWIDTH=0 MARGINHEIGHT=0
 BGCOLOR="#FFFFFF" TEXT="#000000" xLINK="#0000FF" xVLINK="#501080" xALINK="#B0C8EC">  <TABLE BORDER=0 CELLPADDING=9 CELLSPACING=0 WIDTH="100%" BGCOLOR="#B0C8EC">
  <TR ALIGN=CENTER VALIGN=BOTTOM>

    <TD ALIGN=LEFT><NOBR><A TARGET="_top" HREF="http://www.lugnet.com/"><IMG BORDER=0 WIDTH=28 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-home.gif" ALT="To LUGNET Homepage"></A><A TARGET="_top" HREF="http://news.lugnet.com/"><IMG BORDER=0 WIDTH=27 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-news.gif" ALT="To LUGNET News Homepage"></A><A TARGET="_top" HREF="http://guide.lugnet.com/"><IMG BORDER=0 WIDTH=37 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-guide.gif" ALT="To LUGNET Guide Homepage"></A></NOBR><BR></TD>       <FORM NAME="search" ACTION="http://www.lugnet.com/search.cgi" METHOD=POST
       onSubmit="return(MetaSearch(document.search))">  <TD>
        <INPUT TYPE=HIDDEN NAME="category" VALUE="/dear-lego/">
        <NOBR><SELECT NAME="scope">
          <OPTION VALUE="SetGuide">Set Reference
          <OPTION VALUE="QuickSet">Set Reference (Popup)
          <OPTION VALUE="PartsRef">Parts Reference  <OPTION VALUE="News">News
          <OPTION VALUE="NewsRel" SELECTED>News (Dear LEGO)         </SELECT>&nbsp;<A HREF="http://www.lugnet.com/help/search/"><IMG BORDER=0 WIDTH=16 HEIGHT=16 HSPACE=0 VSPACE=0 SRC="http://www.lugnet.com/help/help.gif" ALT="Help on Searching"></A></NOBR><BR>  <NOBR><INPUT TYPE=TEXT NAME="query" VALUE="" SIZE=16 MAXLENGTH=200><SMALL>&nbsp;<INPUT TYPE=SUBMIT NAME="SUBMIT" VALUE="Search"></SMALL></NOBR><BR>
      </TD>
      </FORM> 

    <TD ALIGN=RIGHT><NOBR><A HREF="/news/post/?lugnet.dear-lego"><IMG BORDER=0 WIDTH=22 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-post.gif" ALT="Post new message to lugnet.dear-lego"></A><A HREF="news://lugnet.com/lugnet.dear-lego"><IMG BORDER=0 WIDTH=30 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-nntp.gif" ALT="Open lugnet.dear-lego in your NNTP Newsreader"></A><A HREF="http://news.lugnet.com/news/traffic/"><IMG BORDER=0 WIDTH=32 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-traffic.gif" ALT="To LUGNET News Traffic Page"></A><IMG BORDER=0 WIDTH=3 HEIGHT=44 HSPACE=6 VSPACE=0 SRC="/news/icon-sep.gif"><A HREF="http://www.lugnet.com/people/members/sign-in/"><IMG BORDER=0 WIDTH=37 HEIGHT=44 HSPACE=0 VSPACE=0 SRC="/news/icon-signin-key.gif" ALT="Sign In (Members)"></A></NOBR><BR></TD>

  </TR> 
</TABLE>
<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%" BGCOLOR="#8899BB"><TR><TD><SPACER TYPE=BLOCK WIDTH=1 HEIGHT=1></TD></TR></TABLE>  <TABLE BORDER=0 CELLPADDING=7 CELLSPACING=0 WIDTH="100%" BGCOLOR="#E8F0FF"> <TR ALIGN=CENTER VALIGN=CENTER>
        <TD COLSPAN=2 ALIGN=CENTER VALIGN=CENTER>
<script type="text/javascript"><!--
google_ad_client = "pub-0089902038208374";
//LUGNET 728x15, Erstellt 13.12.07
google_ad_slot = "6645292597";
google_ad_width = 728;
google_ad_height = 15;
//--></script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
        </TD>
      </TR> <TR ALIGN=LEFT VALIGN=CENTER>  <TD>  <BIG><FONT FACE="Geneva,Arial,Helvetica">
        &nbsp;<A HREF="/dear-lego/">Dear&nbsp;LEGO</A>&nbsp;<FONT COLOR="#8899BB">/</FONT>  5668  <BR></FONT></BIG>  </TD>  <TD ALIGN=RIGHT><SMALL><FONT FACE="Geneva,Arial,Helvetica">
        <A HREF="/dear-lego/?n=5667">5667</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="/dear-lego/?n=5669">5669</A>
      <BR></SMALL></FONT></TD>  </TR>

</TABLE>
<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%" BGCOLOR="#8899BB"><TR><TD><SPACER TYPE=BLOCK WIDTH=1 HEIGHT=1></TD></TR></TABLE>  <!-- google_ad_section_start --> <CENTER>  <TABLE BORDER=0 CELLPADDING=16 CELLSPACING=0 WIDTH="100%"><TR><TD ALIGN=LEFT>    <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0><TR ALIGN=LEFT VALIGN=TOP><TD>  <TABLE BORDER=0 CELLPADDING=8 CELLSPACING=0>

      <TR BGCOLOR="#E0E0E0"><TD ALIGN=LEFT> <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%"><TR ALIGN=CENTER VALIGN=TOP>  <TD ALIGN=LEFT VALIGN=TOP>

    <TABLE BORDER=0 CELLPADDING=2 CELLSPACING=0>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Subject:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><BIG><BIG><B>Online PAB and Design-by-me needs more parts for Lego Train</B></BIG></BIG><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Author:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><B>Benjamin Medinets</B><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Newsgroups:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><A HREF="/dear-lego/">lugnet.dear-lego</A>, <A HREF="/trains/">lugnet.trains</A><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Followup-To:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><A HREF="/trains/">lugnet.trains</A><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Date:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1">Thu, 6 Oct 2011 03:44:44 GMT<BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">From:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><FONT COLOR="#7070A0">Benjamin Medinets &lt;bmedinets@excite.com+stopspammers+&gt;</FONT><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Highlighted:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><FONT COLOR="#D57F7F"><B>!</B></FONT> 

<A HREF="/news/ahh.cgi?lugnet.dear-lego,5668">(details)</A><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Viewed:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1">3013 times<BR></FONT></TD>

          </TR>  </TABLE>

    </TD>  <TD WIDTH=20>&nbsp;&nbsp;</TD>

      <TD ALIGN=CENTER VALIGN=TOP>

      <FONT FACE="Geneva,Arial,Helvetica" SIZE="-2"><A HREF="/news/raw.cgi?lugnet.dear-lego,5668">View Raw<BR>Message</A><BR><BR></FONT>  <A HREF="/news/post/?lugnet.dear-lego,5668"><IMG BORDER=0 WIDTH=30 HEIGHT=44 HSPACE=10 VSPACE=10 SRC="/news/icon-reply.gif" TITLE="Post a public reply to this message"></A><BR>  </TD>  </TR></TABLE> </TD></TR>

      <TR BGCOLOR="#F0F0F0"><TD ALIGN=LEFT NOWRAP><TT>I was using Lego Digital Designer and am disappointed the downhill availabilty<BR>
of certain important parts to build &quot;buyable&quot; models.<BR>
<BR>
I would like to see a return of &quot;warehouse&quot; sliding doors to make<BR>
box cars.<BR>
<BR>
Train-style doors would also be nice as well as train windows (both in<BR>
2x3 and 4x3)... please.<BR>
<BR>
I looked at the instructions to build a mail car from the 7722, and<BR>
found that I really only need 2 red sliding rail doors, the pair of<BR>
&quot;decorated train doors&quot; and a set of two 2x3 thin yellow train<BR>
windows.<BR>
<BR>
Yes, there was a bit of minor substitution but it is mostly distiguishable<BR>
as the model.<BR>
<BR>
Here is what it looks like:<BR>
<BR>
<A HREF="http://www.lugnet.com/jump.cgi?http://www.brickshelf.com/gallery/medib/lego-fun/7722mailvan.jpg">http://www.brickshelf.com/gallery/medib/lego-fun/7722mailvan.jpg</A><BR>
<BR>
Yeah I know... where are the f-in doors???<BR>
<BR>
<BR>
Ben<BR>
</TT>
</TD></TR>

      <TR BGCOLOR="#E0E0E0"><TD ALIGN=LEFT></TD></TR>

    </TABLE> <BR> <BR>  <FONT FACE="Verdana,Geneva,Helvetica" SIZE="-1" COLOR="#990000">



      <B>1 Message in This Thread:</B><BR> <NOBR><IMG WIDTH=9 HEIGHT=11 VSPACE=2 SRC="/news/here.gif" TITLE="You are here"></NOBR><BR><NOBR></NOBR>
 <DL>

      <DT>Entire Thread on One Page:

      <SMALL><FONT COLOR="#000000">

        <DD><B>Nested:&nbsp;</B>

        <A HREF="/dear-lego/?n=5668&t=i&v=a">All</A> | <A HREF="/dear-lego/?n=5668&t=i&v=b">Brief</A> | <A HREF="/dear-lego/?n=5668&t=i&v=c">Compact</A> | <A HREF="/dear-lego/?n=5668&t=i&v=d">Dots</A>

        <BR><B>Linear:&nbsp;</B>

        <A HREF="/dear-lego/?n=5668&t=f&v=a">All</A> | <A HREF="/dear-lego/?n=5668&t=f&v=b">Brief</A> | <A HREF="/dear-lego/?n=5668&t=f&v=c">Compact</A>

      </FONT></SMALL>  </DL>



      </FONT>  </TD>

    <TD WIDTH=20>&nbsp;&nbsp;&nbsp;&nbsp;<BR></TD>

    <TD><FONT FACE="Verdana,Geneva,Arial,Helvetica" SIZE="-1">  
<script type="text/javascript"><!--
google_ad_client = "pub-0089902038208374";
//LUGNET 160x600, Erstellt 14.12.07
google_ad_slot = "5985678701";
google_ad_width = 160;
google_ad_height = 600;
//--></script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
<BR>
<style type="text/css"> @import url(http://www.google.com/cse/api/branding.css);
</style>
<div class="cse-branding-bottom" style="background-color:#FFFFFF;color:#000000">
  <div class="cse-branding-form">
    <form action="http://www.google.com/cse" id="cse-search-box">
      <div>
        <input type="hidden" name="cx" value="partner-pub-0089902038208374:9n7bh3k27mb" />
        <input type="hidden" name="ie" value="ISO-8859-1" />
        <input type="text" name="q" size="31" />
        <input type="submit" name="sa" value="Search" />
      </div>
    </form>
  </div>
  <div class="cse-branding-logo">
    <img src="http://www.google.com/images/poweredby_transparent/poweredby_FFFFFF.gif" alt="Google" />
  </div>
  <div class="cse-branding-text">
    Custom Search
  </div>
</div>  </FONT></TD>

    </TR></TABLE>  <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%">
<TR VALIGN=TOP>  </TR></TABLE>  </TD></TR></TABLE>
  </CENTER>
<!-- google_ad_section_end -->  <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 BGCOLOR="#8899BB" WIDTH="100%"><TR>
<TD><SPACER TYPE=BLOCK WIDTH=1 HEIGHT=1></TD></TR></TABLE>

<TABLE BORDER=0 CELLPADDING=4 CELLSPACING=0 BGCOLOR="#E8F0FF" WIDTH="100%">
  <TR VALIGN=TOP>
    <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" SIZE="-2" COLOR="#000033">  <A HREF="/sitemap.cgi">Newsgroup Tree</A> &nbsp;|&nbsp; <A HREF="http://www.lugnet.com/admin/terms/agreement">Terms of Use</A> &nbsp;|&nbsp; <A HREF="http://www.lugnet.com/admin/feedback/">Feedback</A><BR>
    </FONT></TD>
    <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" SIZE="-2" COLOR="#000033"> &copy;2005 LUGNET. All rights reserved. - hosted by <a href="http://www.steinbruch.info/" target="_blank">steinbruch.info GbR</a><BR>
    </FONT></TD> 
  </TR>
</TABLE>

<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-3258989-12");
pageTracker._initData();
pageTracker._trackPageview();
</script>
</BODY>
</HTML>  `
Kasper Christensen
  • 895
  • 3
  • 10
  • 30
  • 1
    I suggest the XML package. Please provide some example code. – sgibb Sep 08 '12 at 16:14
  • So I am thinking that if one can construct a piece of code that can detect "Message is in Reply To:" and then take the next A HREF="link"...? – Kasper Christensen Sep 08 '12 at 16:32
  • I don't see "messages in this thread" in the html you provided. Please edit your question and add the relevant html code – GSee Sep 08 '12 at 16:58
  • 1
    May I suggest you check out these resources: 1. [talkstats.com thread on web scraping (great beginner examples)](http://www.talkstats.com/showthread.php/26153-Still-trying-to-learn-to-scrape) 2. [w3schools.com site on html stuff (very helpful)](http://www.w3schools.com/xpath/xpath_nodes.asp) – Tyler Rinker Sep 08 '12 at 17:02
  • GSee: Hmm having trouble formatting the text so it is easy to read. Sorry but i am new here! Tyler Rinker: Thanks for the hint. Will look into that! – Kasper Christensen Sep 08 '12 at 17:25
  • Alright. I uploaded the first piece of code I wouldlike to extract information from. I know it does not look neat so if anyone who knows how to do this will please teach me how to format the code so it does not appear as one long stirng. I have looked in the hints but it does not provide me much help. – Kasper Christensen Sep 08 '12 at 17:39
  • From the provided html file, can you just share what output *exactly* you would want and what format you would want it in? (`list`, `data.frame`, ...) – A5C1D2H2I1M1N2O1R2T1 Oct 19 '12 at 07:47

2 Answers2

0

If that is your string, then you can get the material bounded by the strings 'A HREF="' using strsplit

txt <- '</TABLE> <BR> <BR>  <FONT FACE="Verdana,Geneva,Helvetica" SIZE="-1" COLOR="#990000"><B>

    Message has 2 Replies: </B></FONT><BR>   <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR VALIGN=TOP BGCOLOR="#E0E0E0"><TD ALIGN=LEFT><A HREF="/dear-lego/?n=14"><IMG BORDER=5 HEIGHT=3 WIDTH=3 SRC="/news/x.gif"></A></TD><TD><FONT SIZE="-2">&nbsp;&nbsp;</FONT></TD><TD ALIGN=LEFT><FONT FACE="Verdana,Geneva,Helvetica" SIZE="-2"><A HREF="/dear-lego/?n=14">Re: Plate Paks</A><BR></FONT></TD><TD ALIGN=RIGHT><FONT FACE="Verdana,Geneva,Helvetica" SIZE="-2">&nbsp;Tom Stangl<BR></FONT></TD></TR><TR BGCOLOR="#F8F8F8"><TD COLSPAN=4 ALIGN=LEFT VALIGN=TOP><FONT FACE="Verdana,Geneva,Helvetica" SIZE="-2" '

This is the second fragment:

> strsplit(txt, split='A HREF="')[[1]][2]
[1] "/dear-lego/?n=14\"><IMG BORDER=5 HEIGHT=3 WIDTH=3 SRC=\"/news/x.gif\"></A></TD><TD><FONT SIZE=\"-2\">&nbsp;&nbsp;</FONT></TD><TD ALIGN=LEFT><FONT FACE=\"Verdana,Geneva,Helvetica\" SIZE=\"-2\"><"

There are probably real XML and HTML processing steps but they generally require an example with all the headers and you have removed all those.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

You may see the following link:

Is there a simple way in R to extract only the text elements of an HTML page?

I think it best matches your question

Community
  • 1
  • 1
Ali
  • 9,440
  • 12
  • 62
  • 92