1

I am a new programmer and we are working on a Graduate English project where we are trying to parse a gigantic dictionary text file (500 MB). The file is set up with html-like tags. I have 179 author tags eg. "[A>]Shakes.[/A]" for Shakespeare and what I need to do is find each occurrence of every tag and then write that tag and what follows on the line until I get to "[/W]".

My problem is that readlines() gives me a memory error (I am assuming because the file is so large) and I have been able to find matches (but only once) and not been able to get it to look past the first match. Any help that anyone could give would be greatly appreciated.

There are no new lines in the text file which I think is causing the problem. This problem has been solved. I thought I would include the code that worked:

with open('/Users/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Desktop/2e.txt','r') as open_file:
    the_whole_file = open_file.read()
    start_position = 0
    while True:
        start_position = the_whole_file.find('<A>', start_position)
        if start_position < 0:
            break
        start_position += 3
        end_position = the_whole_file.find('</W>', start_position)
        output_file.write(the_whole_file[start_position:end_position])
        output_file.write("\n")    
        start_position = end_position + 4
English Grad
  • 1,365
  • 5
  • 21
  • 40
  • 2
    [This answer](http://stackoverflow.com/questions/3893885/cheap-way-to-search-a-large-text-file-for-a-string/3893931#3893931) might help. – Paolo Jul 22 '11 at 13:52
  • If you want help with how to match etc. please show us code with what you're doing now. – agf Jul 22 '11 at 14:04
  • You might also be interested in this: http://www.nltk.org/book. It covers natural language processing with python, as well as serving as a gentle introduction to the python language (assumption based on your graduate english major status). – chris Jul 22 '11 at 14:50
  • @English Grad Are all the tags of the same type, that is to say beginning with **[A>]** and ending with **[/A]** ? – eyquem Jul 22 '11 at 15:43
  • @English Grad Are there newlines in the text , that is to say characters **\r** or **\n** or **\r\n** ? – eyquem Jul 22 '11 at 15:46
  • @English Grad What is **[/W]** ? Is it the end of an element whose beginning would be **[W>]** ? – eyquem Jul 22 '11 at 15:47
  • @English Grad A file containing a normal text, that is to say in which there are newlines, can be read by iteration like that: ``handle = open('filename.txt','r')`` and ``for line in handle:`` If it isn't such a text, we must know what kind of other structure it has, that is to say how are delimited portions of information, each portion concerning one author. Is an information portion delimited by **[W>°** and **[/W]** ? – eyquem Jul 22 '11 at 15:58
  • @English Grad After knowing that, it will be possible to iterate in the file differently than with the above iteration ``for line in handle``; we will use a **while** loop and the 'read(n)' method where **n** is the number of bytes we want to read in one time. – eyquem Jul 22 '11 at 15:59
  • Hi there, thanks for your help. The file is a giant text file that does not have any new lines in the text. All of the tags open with "" and end with "". – English Grad Jul 22 '11 at 17:19
  • eyquem: This is a sample of code that I am trying to Parse: AAA e&mac.ie&shti., the first letter of the Roman Alphabet, and of its various subsequent modifications (as were its prototypes Alpha of the Greek, and Aleph of the Ph&oe.nician and old Hebrew); I am trying to pull out the tags ... and .... They usually sit next to each other in the file. The whole text file is 546 MB – English Grad Jul 22 '11 at 17:50
  • @English Grad Please, give a sufficient portion of the file in order that someone can really work on the problem. There are no tags ... and ... in the sample you gave. I'd like to see the structure text with precision to work on it. You can send it to my email eyguem@gmail.com (eyguem with **g**, not eyquem) – eyquem Jul 22 '11 at 21:31

6 Answers6

3

After opening the file, iterate through the lines like this:

input_file = open('huge_file.txt', 'r')
for input_line in input_file:
   # process the line however you need - consider learning some basic regular expressions

This will allow you to easily process the file by reading it in line by line as needed rather than loading it all into memory at once

dtanders
  • 1,835
  • 11
  • 13
2

I don't know regular expressions well, but you can solve this problem without them, using the string method find() and line slicing.

answer = ''

with open('yourFile.txt','r') as open_file, open('output_file','w') as output_file:
    for each_line in open_file:
        if each_line.find('[A>]'):
            start_position = each_line.find('[A>]')
            start_position = start_position + 3
            end_position = each_line[start_position:].find('[/W]')

            answer = each_line[start_position:end_position] + '\n'
            output_file.write(answer)

Let me explain what is happening:

  1. Create an empty 'list' using = []. This will hold your answers.
  2. Use the with... statement. This allows you to open your file as an alias (I chose open_file). This ensures automatic closing of your file whether or not your program runs correctly.
  3. We use the 'for line in file:' idiom to tackle the file one line at a time. The 'line' variable can be named anything (i.e. for x in file, for pizza in file) and will always contain each line as a string. When it gets to the end of the file, it automatically stops.
  4. the 'if each_line.find('[A>]'):' statement simply tests if the starting tag is in that line. If it is not, none of the indented code that follows will run, and the loop will re-start, moving to the next line.
  5. We use string slicing, where we can cut out the part of the string we want. What we do is search for the first tag by position (which we already know is in this line), then search for the stop tag by position. Once we have those, we can simply cut out the part we want.
  6. I buffed up the position in two ways. 1 I added 3 to the start position so it would skip over the [A>] - thus instead of giving '[A>] THIS IS MY STRING...' it just gives 'THIS IS MY STRING...' I then searched for the end position by looking for its first occurence AFTER the [A>] tag, inc ase the [/W] tag occurrs more than once each line.
  7. We set the answer to the string slice, and a new line character ('\n') so each string appears on its own line. We use the output method .write('stringToWrite') to write each string, one at a time.
chris
  • 2,404
  • 3
  • 27
  • 33
  • George, thank you so much for the explanation. I find it very hard to learn without these explanations. If I want to display the results do I "print answerList"> – English Grad Jul 22 '11 at 14:55
  • When I try to run this it give me this error code. Traceback (most recent call last):
    File "C:\Users\Desktop\Search2.py", line 8, in end_position = each_line[start_position:].find('[/W]') MemoryError After you explanation I understand the code but is the memory error because the file is too big? I feel like the whole 500 mb text file might be being treated as one line. Is this possible?
    – English Grad Jul 22 '11 at 14:58
  • Its possible that the answserList is growing too large (i.e. too many resutls). I'll edit the code in a bit to output to another file line by line (busy atm). – chris Jul 22 '11 at 15:03
  • Give it another whirl. Also if your computer doesn't have alot of memory, a re-boot might free some up. FYI, check-out my comment on your original question regarding the nltk book (free online) - you might thumb through it in your spare time. – chris Jul 22 '11 at 16:10
  • George, I am getting the same error message. I think it has to do with the fact that the txt file (546 MB) has no line breaks in it. Would this mean that the program is trying to read the whole file at once? Thanks again for your help. Everyone on here has been so generous already. – English Grad Jul 22 '11 at 17:46
  • This is a sample of code that I am trying to Parse: AAA e&mac.ie&shti., the first letter of the Roman Alphabet, and of its various subsequent modifications (as were its prototypes Alpha of the Greek, and Aleph of the Ph&oe.nician and old Hebrew); I am trying to pull out the tags ... and .... They usually sit next to each other in the file – English Grad Jul 22 '11 at 17:49
  • English Grad - Edit original post to qualify that there are no new lines - that is very important and precisely why reading line by line is not helping. The solution is a bit beyond me but I'll try and figure it out if I have time. @eyquem I was not suggesting nltk for THIS project - but as a general tool of interest for this person (Graduate English Studies). The book assumes no python knowledge and thus would be a good and relevant introduction for English Grad. – chris Jul 22 '11 at 19:59
1

You're getting a memory error with readlines() because given the filesize you're likely reading in more data than your memory can reasonably handle. Since this file is an XML file, you should be able to read through it iterparse(), which will parse the XML lazily without taking up excess memory. Here's some code I used to parse Wikipedia dumps:

for event, elem in parser:
    if event == 'start' and root == None:
        root = elem
    elif event == 'end' and elem.tag == namespace + 'title':
        page_title = elem.text
        #This clears bits of the tree we no longer use.
        elem.clear()
    elif event == 'end' and elem.tag == namespace + 'text':
        page_text = elem.text
        #Clear bits of the tree we no longer use
        elem.clear()

        #Now lets grab all of the outgoing links and store them in a list
        key_vals = []


        #Eliminate duplicate outgoing links.
        key_vals = set(key_vals)
        key_vals = list(key_vals)

        count += 1

        if count % 1000 == 0:
            print str(count) + ' records processed.'
    elif event == 'end' and elem.tag == namespace + 'page':
        root.clear()

Here's roughly how it works:

  1. We create parser to progress through the document.

  2. As we loop through each element of the document, we look for elements with the tag you are looking for (in your example it was 'A').

  3. We store that data and process it. Any element we are done processing we clear, because as we go through the document it remains in memory, so we want to remove anything we no longer need.

angusiguess
  • 639
  • 5
  • 11
0

You should look into a tool called "Grep". You can give it a pattern to match and a file, and it will print out occurences in the file and line numbers, if you want. Very useful and probably can be interfaced with Python.

Patrick87
  • 27,682
  • 3
  • 38
  • 73
  • For instance, "grep '^a*b*c$' should match whole lines consisting of an a, followed by anything, followed by b, followed byanything, ending with c. You should double-check grep regex syntax first, though. – Patrick87 Jul 22 '11 at 13:51
0

Instead of parsing the file by hand why not parse it as XML to have better control of the data? You mentioned that the data is HTML-like so I assume it is parseable as an XML document.

Manny D
  • 20,310
  • 2
  • 29
  • 31
0

Please, test the following code:

import re

regx = re.compile('<A>.+?</A>.*?<W>.*?</W>')

with open('/Users/Desktop/2e.txt','rb')         as open_file,\
     open('/Users/Desktop/Poetrylist.txt','wb') as output_file:

    remain = ''

    while True:
        chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16
        if not chunk:  break
        output_file.writelines( mat.group() + '\n' for mat in regx.finditer(remain + chunk) )
        remain = chunk[mat.end(0)-len(remain):]

I couldn't test it because I have no file to test on.

eyquem
  • 26,771
  • 7
  • 38
  • 46