0

I am using xml2 package in R to extract certain nodes that have same class name. I am trying to extract start and end dates(both have class name 'date') which appear below the 'role' and 'company' tags in the xml. But there are other date tags associated with trainings which i dont need.Also,the format varies depending on xml. Is there any function that helps me select the date tags that follows each of the role tag? Below is the xml snippet:

<span class="work-hist-mark" id="57" inprof="n">CAREER HISTORY:</span>
No Company Position Years * 
<span class="company" id="58" inprof="y">Nasioncom</span>
<span class="role" id="59_1" inprof="y">Helpdesk</span>
1st level 
<span class="date" id="60_1" inprof="y">Jan 1999</span>
- 
<span class="date" id="60_2" inprof="y">June 2000</span>
* 
<span class="role" id="61_1_1" inprof="y">Komputer Sistem System Engineer</span>
<span class="date" id="61_2_1" inprof="y">June 2000</span>
- 
<span class="date" id="61_2_2" inprof="y">Oct 2003</span>
* 
<span class="role" id="62_1_1" inprof="y">Servicesoft Network Engineer</span>
<span class="date" id="62_2_1" inprof="y">Oct 2003</span>
- 
<span class="date" id="62_2_2" inprof="y">June 2006</span>
* 
<span class="company" id="63_1" inprof="y">EDS</span>
<span class="role" id="63_2_1" inprof="y">Infrastructure Associate</span>
<span class="date" id="63_3_1" inprof="y">July</span>
- 
<span class="date" id="63_3_2" inprof="y">Nov 2006</span>
* 
<span class="company" id="64_1" inprof="y">Atos Origin</span>
<span class="role" id="64_2_1" inprof="y">Technical Specialist</span>
<span class="date" id="64_3_1" inprof="y">Nov 2006</span>
- 
<span class="date" id="64_3_2" inprof="y">Nov 2008</span>
* 
<span class="company" id="65" inprof="y">Hewlett Packard</span>
<span class="role" id="66_1" inprof="y">Wintel Server Specialist</span>
Level 3 
<span class="date" id="67_1" inprof="y">Nov 2008</span>
to 
<span class="date" id="67_2" inprof="y">present</span>
TRAINING ATTENDED: 
<span class="date" id="68" inprof="y">2001</span>
<span class="sofwr" id="69" inprof="y">HP</span>
& 
<span class="sofwr" id="70" inprof="y">Compaq Proliant server</span>
series 
<span class="date" id="71_1_1" inprof="y">2003</span>
/
<span class="date" id="71_1_2" inprof="y">05</span>
<span class="role" id="71_2_1" inprof="y">Sophos Antivirus Technical Consultant</span>
<span class="company" id="71_3" inprof="y">Mail Monitor SMTP</span>
<span class="location" id="71_4" inprof="y">Pure</span>
Message for 
<span class="sofwr" id="72" inprof="y">Exchange</span>
or 
<span class="sofwr" id="73" inprof="y">UNIX</span>
(antivirus + antispam) SAV Integrated (http web scanning) Remote Update (design for mobile user) Sophos in multiple platforms (open source eg: 
<span class="sofwr" id="74" inprof="y">UNIX</span>
, 
<span class="sofwr" id="75" inprof="y">Linux</span>
, 
<span class="sofwr" id="76" inprof="y">Mac9 &10</span>
, 
<span class="sofwr" id="77" inprof="y">FreeBSD</span>
) 
<span class="company" id="78" inprof="n">Small Business Enterprise</span>
<span class="date" id="79" inprof="y">2005</span>
Watchguard X500/ X2500 Add-on: 
<span class="company" id="80" inprof="y">GatewayAV, Weblocker & Spam</span>
screen 
<span class="date" id="81" inprof="n">2007</span>
<span class="sofwr" id="82" inprof="y">Microsoft Windows Vista</span>
Install, configuring and managing 
<span class="sofwr" id="83" inprof="y">Windows Vista</span>
Vishnu
  • 110
  • 2
  • 10

1 Answers1

0

This is interesting because the data is dirty (i.e. some dates are just years, others are the first three letters of a month concatenated with a year and a full month).

I am unsure how you will choose to address the dirty data component, but you're looking for the readr package, specifically the parse_date command.

Here's an example. Let's say I have a string that says "Jan foo 05, 2016 bar" and I want the datetime object from the data.

library(readr)
df1 <- "Jan foo 05, 2016 bar"
parse_date(df1, "%b foo %d, %Y bar")

[1] "2016-01-05"

You'll need to take the same approach. I would suggest storing each line as an observation, then filtering your observations down to only where dates occur. From there you could use the same approach using parse_date as I have done. Because your dates are formatted differently you'll need a function, if/else, or some other type of handler to accommodate for the differences in data.

For the filtering component you could use the filter command from dyplr using the method mentioned on this thread.

Make sense? Good luck!

Joel Alcedo
  • 192
  • 4