If you're working with pseudo-XML, it's probably best to define the parsing rules yourself. I like stringr
and dplyr
for stuff like this.
Here's a two-element vector (instead of 312 in your case):
vec <- c(
"<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >",
"<severity='5', hostname='computername126', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='235' >"
)
Convert it to a data.frame
object:
df <- data.frame(vec, stringsAsFactors = FALSE)
And select out your data based on their character index positions, relative to the positions of your variables of interest:
require(stringr)
require(dplyr)
df %>%
mutate(
severityStr = str_locate(vec, "severity")[, "start"],
hostnameStr = str_locate(vec, "hostname")[, "start"],
sourceStr = str_locate(vec, "source")[, "start"],
moduleStr = str_locate(vec, "module")[, "start"],
processStr = str_locate(vec, "process")[, "start"],
pidStr = str_locate(vec, "pid")[, "start"],
endStr = str_locate(vec, ">")[, "start"],
severity = substr(vec, severityStr + 10, hostnameStr - 4),
hostname = substr(vec, hostnameStr + 10, sourceStr - 4),
source = substr(vec, sourceStr + 8, moduleStr - 4),
module = substr(vec, moduleStr + 8, processStr - 4),
process = substr(vec, processStr + 9, pidStr - 4),
pid = substr(vec, pidStr + 5, endStr - 3)) %>%
select(severity, hostname, source, module, process, pid)
Here's the resulting data frame:
severity hostname source module process pid
1 4 computername125 PackageDownload herpderp.dll masterP.exe 234
2 5 computername126 PackageDownload herpderp.dll masterP.exe 235
This solution is robust enough to handle string inputs of different lengths. For example, it would read pid
in correctly even if it's 95
(two digits instead of three).