2

I'm trying to extract the xml portion of code from a txt file in python. The current txt file I'm using is from the edgar database and has multiple representations of a 10-k report in one txt file, having html then xml, and then some other representations like PDF.

If anyone knows a way to extract this xml so I can use it's tags, I'd greatly appreciate it.

Here's an example of the txt file I'm talking about: https://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007.txt

segfault
  • 65
  • 6
  • 1
    You are going to have to provide a bit more detail here. Examples, etc. – OldProgrammer Apr 28 '20 at 02:12
  • Sure, here's an example of the txt file i'm dealing with. As you can see there is xml inside that I need to extract. https://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007.txt – segfault Apr 28 '20 at 02:14
  • This is a very unfortunate file design, because you can't find the end of an XML document reliably except by using an XML parser, and an XML parser expects to find the end of file immediately after the end of the document. Change the design if you can. – Michael Kay Apr 28 '20 at 06:46
  • Are you talking about the exhibits formatted as xbrl documents and embedded in the file - for example, "EX-101.INS"? There are several of those. There are other segments that are identified by "XML" but are really html. Just a heads up - working with EDGAR filings is a major PITA... – Jack Fleeting May 01 '20 at 11:39

2 Answers2

1

You can try using:

import requests, re

text = requests.get("https://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007.txt").text
for xml in re.finditer(r"<FILENAME>([^\s]+.xml)\s<DESCRIPTION>[^\s]+\s<TEXT>\s<XBRL>(.*?)</XBRL>", text, re.IGNORECASE | re.DOTALL | re.MULTILINE):
    xml_filename = xml.group(1)
    xml_content = xml.group(2)
    with open(xml_filename, "w") as w:
        w.write(xml_content)

Demo

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • Thank you for your detailed answer. I couldn't get this to work but when I changed the regex to '([^\s]+.xml)\s(.*?)(.*?)' it worked. xml_content was now group(3) i.e. xml_content = xml.group(3) – Peter H Mar 04 '21 at 15:39
  • They may have changed their code meanwhile. Glad you figure it out. – Pedro Lobito Mar 04 '21 at 16:34
0

How about this?

def getData(xml):
  # Processing your XML data after block reading. 
  pass
with open('0000051143-13-000007.txt', 'r') as file: # data.xml is your xml file path
  lines = []
  flag = False
  for line in file:
    if line.find('</XBRL>')>=0:
      getData("".join(lines))
      flag = False
      lines = []
    if flag or line.find('<?xml ')>=0:
      flag = True
      lines.append(line)
dabingsou
  • 2,469
  • 1
  • 5
  • 8