1

I'm having difficulty getting a XML file into R as a data frame. The data I working with is free to download from The Securities and Exchange Commission and can be found HERE.

The first downloadable option SEC Investment Advisor Report (37 MB) is the one I'll use as an example. After I download the file, I unzip it in my terminal (mac) with the following code: gunzip -d IA_FIRM_SEC_Feed_02_05_2017.xml.gz. This unzips into XML format. And from here, I've been struggling to get this into an R data frame. I've tried different variations of code using the XML package, but can only get one variable or two variables into the data frame. Here is an example of what the data looks like when I open the XML file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<IAPDFirmSECReport GenOn="2017-02-05">
<Firms>
<Firm>
<Info SECRgnCD="DRO" FirmCrdNb="116085" SECNb="801-63010" BusNm="JACOBSEN CAPITAL MANAGEMENT" LegalNm="JACOBSEN CAPITAL MANAGEMENT, LLC"/>
<MainAddr Strt1="280 S 400 W" Strt2="SUITE 220" City="SALT LAKE CITY" State="UT" Cntry="United States" PostlCd="84101" PhNb="801.746.7171" FaxNb="801.363.5886"/>
<MailingAddr/>
<Rgstn FirmType="Registered" St="APPROVED" Dt="2004-04-07"/>
<NoticeFiled>
<States RgltrCd="AZ" St="FILED" Dt="2009-12-10"/>
<States RgltrCd="UT" St="FILED" Dt="2009-12-10"/>
</NoticeFiled>
<Filing Dt="2016-11-09" FormVrsn="10/2012"/>
<FormInfo>
<Part1A>
<Item1 Q1I="Y" Q1M="N" Q1N="N" Q1O="N">
<WebAddrs>
<WebAddr>HTTP://WWW.JACOBSENCAPITAL.COM</WebAddr>
</WebAddrs>
</Item1>
<Item2A Q2A1="Y" Q2A2="N" Q2A3="N" Q2A4="N" Q2A5="N" Q2A6="N" Q2A7="N" Q2A8="N" Q2A9="N" Q2A10="N" Q2A11="N" Q2A12="N" Q2A13="N"/>
<Item2B/>
<Item3A OrgFormNm="Limited Liability Company"/>
<Item3B Q3B="NOVEMBER"/>
<Item3C StateCD="UT" CntryNm="United States"/>
<Item5A TtlEmp="4"/>
<Item5B Q5B1="2" Q5B2="0" Q5B3="4" Q5B4="0" Q5B5="0" Q5B6="0"/>
<Item5C Q5C1="26-100" Q5C2="0"/>
<Item5D Q5D1A="Up to 10 percent" Q5D1B="76-99 percent" Q5D1C="0 percent" Q5D1D="0 percent" Q5D1E="0 percent" Q5D1F="0 percent" Q5D1G="Up to 10 percent" Q5D1H="Up to 10 percent" Q5D1I="0 percent" Q5D1J="0 percent" Q5D1K="0 percent" Q5D1L="0 percent" Q5D1M="0 percent" Q5D2A="Up to 25 percent" Q5D2B="More than 75 percent" Q5D2C="0 percent" Q5D2D="0 percent" Q5D2E="0 percent" Q5D2F="0 percent" Q5D2G="Up to 25 percent" Q5D2H="Up to 25 percent" Q5D2I="0 percent" Q5D2J="0 percent" Q5D2K="0 percent" Q5D2L="0 percent" Q5D2M="0 percent"/>
<Item5E Q5E1="Y" Q5E2="Y" Q5E3="N" Q5E4="Y" Q5E5="N" Q5E6="N" Q5E7="N"/>
<Item5F Q5F1="Y" Q5F2A="190000000" Q5F2B="0" Q5F2C="190000000" Q5F2D="342" Q5F2E="0" Q5F2F="342"/>
<Item5G Q5G1="Y" Q5G2="Y" Q5G3="N" Q5G4="N" Q5G5="N" Q5G6="Y" Q5G7="Y" Q5G8="N" Q5G9="N" Q5G10="N" Q5G11="N" Q5G12="N"/>
<Item5H Q5H="26-50"/>
<Item5I Q5I1="N" Q5I2="N"/>
<Item5J Q5J="N"/>
<Item6A/>
<Item6B Q6B1="N" Q6B3="N"/>
<Item7A Q7A1="N" Q7A2="N" Q7A3="N" Q7A4="N" Q7A5="N" Q7A6="N" Q7A7="N" Q7A8="N" Q7A9="N" Q7A10="Y" Q7A11="N" Q7A12="N" Q7A13="N" Q7A14="N" Q7A15="N" Q7A16="N"/>
<Item7B Q7B="N"/>
<Item8A Q8A1="N" Q8A2="Y" Q8A3="N"/>
<Item8B Q8B1="N" Q8B2="N" Q8B3="N"/>
<Item8C Q8C1="Y" Q8C2="Y" Q8C3="N" Q8C4="N"/>
<Item8D/>
<Item8E Q8E="N"/>
<Item8F/>
<Item8G Q8G1="N"/>
<Item8H Q8H="N"/>
<Item8I Q8I="N"/>
<Item9A Q9A1A="N" Q9A1B="N"/>
<Item9B Q9B1A="N" Q9B1B="N"/>
<Item9C Q9C1="N" Q9C2="N" Q9C3="N" Q9C4="N"/>
<Item9D Q9D1="N" Q9D2="N"/>
<Item9E/>
<Item9F Q9F="1"/>
<Item10A Q10A="N"/>
<Item11 Q11="N"/>
<Item11A Q11A1="N" Q11A2="N"/>
<Item11B Q11B1="N" Q11B2="N"/>
<Item11C Q11C1="N" Q11C2="N" Q11C3="N" Q11C4="N" Q11C5="N"/>
<Item11D Q11D1="N" Q11D2="N" Q11D3="N" Q11D4="N" Q11D5="N"/>
<Item11E Q11E1="N" Q11E2="N" Q11E3="N" Q11E4="N"/>
<Item11F Q11F="N"/>
<Item11G Q11G="N"/>
<Item11H Q11H1A="N" Q11H1B="N" Q11H1C="N" Q11H2="N"/>
</Part1A>
</FormInfo>
</Firm>
<Firm>

It looks like the variables are listed directly before the equals sign and the content(row item) is listed directly after the equals sign. For example in the line: <Info SECRgnCD="DRO" FirmCrdNb="116085" SECNb="801-63010" BusNm="JACOBSEN CAPITAL MANAGEMENT" LegalNm="JACOBSEN CAPITAL MANAGEMENT, LLC"/>

I would want the data frame to look as follows:

SECRgnCD FirmCrdNB  SECNb      BusNm
DRO       116085    801-63010  JACOBSEN CAPITAL MANAGEMENT

Thanks for your help.

richiepop2
  • 348
  • 1
  • 12
  • XML is a structured data format, with nested data structures, attributes on elements, and namespaces. There's no magic bullet that can turn any XML into a data frame. What elements do you want to make into rows of your data frame, and which parts of those elements do you want as columns? – Spacedman Feb 05 '17 at 20:34
  • Duplicate: http://stackoverflow.com/questions/21648980/transform-xml-into-a-data-frame?rq=1 – Spacedman Feb 05 '17 at 20:35
  • @richiepop2, can you show us one of your code tries? – Derek Corcoran Feb 05 '17 at 20:41
  • @Spacedman I updated the post to reflect what I'd like in rows and columns. Thank you. – richiepop2 Feb 05 '17 at 21:00

2 Answers2

2

The following code should work;

 require(XML)
 data <- xmlParse('./IA_FIRM_SEC_Feed_02_05_2017.xml')
 result <- as.data.frame(t(xmlSApply(data["//Firms/Firm/Info"],xmlAttrs)),
                    stringsAsFactors=FALSE)
Mathew Savage
  • 168
  • 11
  • thanks @Mathew Savage, that works perfectly! Curious, how might I change that code to capture all of the variables in between the firm brackets(see updated code above)? This is what I was hoping the code would achieve, but realize I was not crystal clear in my description. Thanks so much! – richiepop2 Feb 06 '17 at 00:30
1

Consider even using XML's internal variable, xmlAttrsToDataFrame() using the triple colon operator:

library(XML)

df <- XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//Info'))
# OR df <- XML:::xmlAttrsToDataFrame(doc['//Info'])

df
#   SECRgnCD FirmCrdNb     SECNb                       BusNm                          LegalNm
# 1      DRO    116085 801-63010 JACOBSEN CAPITAL MANAGEMENT JACOBSEN CAPITAL MANAGEMENT, LLC

And per OP's update to capture all attributes under <Firm>, use above method in a lapply loop. Below trycatch is used to capture elements without attributes that raises a warning:

# CAPTURE ALL ELEMENT NAMES UNDER <Firm> AND <Part1A>
firmsnames <- append(xpathSApply(doc, "//Firm/*", xmlName),
                     xpathSApply(doc, "//Part1A/*", xmlName))

# ITERATE THROUGH EACH ELEMENT BINDING ATTRS TO DATAFRAME    
dfList <- lapply(firmsnames, function(n) {
  tryCatch({
    return(XML:::xmlAttrsToDataFrame(doc[paste0('//', n)]))
  }, warning=function(w) return(data.frame(NA)))
})

# FILTER OUT EMPTY DF ELEMENTS IN LIST
dfList <- Filter(function(i) length(i)>0, dfList) 

# COLUMN BIND ALL LIST ELEMENTS
df <- do.call(cbind, dfList)

# RETURNS A DATA FRAME OF 1 ROW AND 182 COLUMNS 
str(df)
# data.frame':  1 obs. of  182 variables:
#  $ SECRgnCD : Factor w/ 1 level "DRO": 1
#   ..- attr(*, "names")= chr "SECRgnCD"
#  $ FirmCrdNb: Factor w/ 1 level "116085": 1
#   ..- attr(*, "names")= chr "FirmCrdNb"
#  $ SECNb    : Factor w/ 1 level "801-63010": 1
#   ..- attr(*, "names")= chr "SECNb"
#  $ BusNm    : Factor w/ 1 level "JACOBSEN CAPITAL MANAGEMENT": 1
#   ..- attr(*, "names")= chr "BusNm"
#  $ LegalNm  : Factor w/ 1 level "JACOBSEN CAPITAL MANAGEMENT, LLC": 1
#   ..- attr(*, "names")= chr "LegalNm"
# ...
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • thanks @Parfait! This solution works well, too. Quick question. When you say: `And per OP's update to capture all attributes under , use above method in a lapply loop.`, could you show me the code on how to do that? There are approximately 60 variables and about 16k observations in this table. – richiepop2 Feb 06 '17 at 02:58
  • I'm not understanding. This is the code to do that. Above method meant `xmlAttrsToDataFrame`. Run this solution with your posted XML to see. – Parfait Feb 06 '17 at 15:37