-1

I'm looking to scrape a large number of files off the sec.gov website, and it's going well so far. The problem is that the old files are in a .txt kind of format and do not have any sort of real HTML formatting. Is there any way to get the information out of these files using Python?

Here's a link to an example document

I have about 30,000 of these guys to do, and the old documents are the ones that my boss really would like... I'm currently using BeautifulSoup4 for the other scrapes that are properly formatted.

Thanks in advance!

Retroflux
  • 57
  • 1
  • 1
  • 9
  • 1
    What kind of information are you trying to get? – Bubble Hacker May 24 '16 at 19:44
  • 2
    If they aren't HTML that's not web scraping, just plain parsing. – jonrsharpe May 24 '16 at 19:46
  • 1
    You need to add some expected output. – Padraic Cunningham May 24 '16 at 19:50
  • In order to parse text files, you can use just plain Python, as string functions and regular expressions. I wrote a small library to help on this task, where you define what you want to extract as a model definition. It works well on this cases where you have semi-structured data. Maybe It can help you: https://github.com/fgmacedo/raspador – Fernando Macedo May 24 '16 at 20:07
  • 1
    Yes, Python excels at text processing. Whether the documents contain sufficient logical information to allow you to extract the data you want is something we can't tell from the question. – Larry Lustig May 24 '16 at 20:11

2 Answers2

4

If you are able to get the text files, you should just need basic text file parsing.

Something like this should be fine for your purposes: http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python

Specifically, to open a file that you have locally you can use something like this:

file = open("newfile.txt", "r")

Where the first argument is the name of your file and the second argument is the mode you want to open the file in ("r" stands for read). You then can use various methods like file.read(), file.readline(), or file.readlines() to get characters from the text file.

If you want to read words from the text file specifically, check out Reading a text file and splitting it into single words in python as well. The answer there shows you how to iterate through all the words in a text file that is located in the same directory as your python script.

with open('words.txt','r') as f:
    for line in f:
        for word in line.split():
           print(word)  

If you don't have the file locally downloaded but you have the URL, this should also help you out: In Python, given a URL to a text file, what is the simplest way to read the contents of the text file?

The specific part in that link you are looking for is this:

import urllib2  # the lib that handles the url stuff

data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
Community
  • 1
  • 1
Kush
  • 956
  • 6
  • 14
  • Aside from the question not having enough information, why vote down? This is exactly what he is asking to do... – Kush May 24 '16 at 19:57
  • 2
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/12461934) – pppery May 24 '16 at 20:38
  • Fair enough! Thanks for the info ppperry. – Kush May 24 '16 at 20:44
  • Considering how little information I gave, I think his answer is more than acceptable. My question was fairly basic, and he gave a more than basic answer. For whatever reason, I forgot that no matter where it is, it's still a text file. Thanks for the advice, I'll check urllib today! – Retroflux May 25 '16 at 11:31
0

On this specific example using urllib.request to GET the file and lxml to parse:

import urllib.request
broken_xml = urllib.request.urlopen('https://www.sec.gov/Archives/edgar/data/20/000089322004000596/w93059exv31w1.txt').read().decode('utf-8')
from lxml import etree
from io import StringIO
tree = etree.parse(StringIO(broken_xml), parser = etree.XMLParser(encoding='utf-8', recover=True))
tree.xpath('//SEQUENCE/text()')
# ['7\n']
tree.xpath('//FILENAME/text()')
# ['w93059exv31w1.txt\n']
Tony DiFranco
  • 793
  • 5
  • 11