Efficiently Working with (and generating) Large Text Files

Question

As part of my work, I am working with very large text files and, in part, analyzing them for word and phrase frequency. I am running into difficulties of computing time, memory restrictions, and in extracting relevant information.

For this program, I am taking a large text file (say 50MB) which has already been cleaned up, turned into lower case. But otherwise it is just unstructured text. I am trying to generate lists of 'bigrams,' 'trigrams,' 'quadgrams,' and 'fivegrams' - respectively, combinations of recurring two, three, four, and five word phrases (i.e. "i am" is a bigram, "i am free" is a trigram, "i am free always" is a quadgram).

What am I currently doing?

Here is my current code, where inputlower is an all lowercase String (scraped web data w/ Mathematica).

inputlower=Import["/directory/allTextLowered.txt"];
bigrams = 
  Sort[Tally[Partition[inputlower, 2, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/bigrams.txt", bigrams];    
Clear[bigrams];
trigrams = 
  Sort[Tally[Partition[inputlower, 3, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/trigrams.txt", trigrams];
Clear[trigrams];    
quadgrams = 
  Sort[Tally[Partition[inputlower, 4, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/quadrams.txt", quadgrams];
Clear[quadgrams];
fivegrams = 
  Sort[Tally[Partition[inputlower, 5, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/fivegrams.txt", fivegrams];

In a way, it works well: I do get information generated, and on smaller scales I find that this code works fast enough that I can have something approximating a workable Manipulate[] program. But when we're dealing with big inputs...

What is wrong with it when I use big files?

Most importantly, my output files are too big to be usable. Is there a way to specify a breaking point in the code: for example, I do not want any 'bigrams' that only appear once? If that proves to still leave too much information, would there be a way to specify that I do not want any 'bigrams' in the file unless they appear more than say 10 times? i.e. if "my cheese" appears 20 times, I want to know about it, but if "i pad" only appears once, maybe losing it would make the file more manageable?

Secondly, these processes take a long time: it took well over two or three hours to generate the bigram output alone. Am I approaching this problem in an efficient manner?

Thirdly, if I did have a large bigram file (~650MB+) that contained ALL of the information, is there a way for Mathematica to access information without loading it all into memory - i.e. taking a file named bigrams.txt, learning that it contains {{"i","am"},55} without bogging the system down?

Edit

[As of 7 Dec 11, I've removed the example file that I've put up - thanks to all again]

If you could post somewhere a large-ish ready-to-read text file (compressed, of course!), then it'd be easier for us to experiment and come up with something hopefully faster — Szabolcs, Nov 23 '11 at 20:04
@Szabolcs Great idea - I'm uploading a file right now. Once it is up, I will edit my question with a public link to it. — canadian_scholar, Nov 23 '11 at 20:15
This may be useful too: http://stackoverflow.com/questions/8254429/reading-utf-8-text-files-with-readlist — Szabolcs, Nov 24 '11 at 16:32

Leonid Shifrin · Accepted Answer · 2011-11-24T01:00:47.903

Introduction

What I will propose is different from most suggestions given before, and based on a combination of indexing, hash-tables, packed arrays, Compress, .mx files and DumpSave, and a few other things. The basic idea is to preprocess the file in a smart way, and save the preprocessed definitions in an .mx file for fast loading. Rather than base most of the work on reading from disk, I propose to shift the accents, and do most of the work in-memory, but find ways to load data from disk, store it in RAM, work with data, and save it on disk in both time and memory - efficient manner. In trying to achieve this goal, I will use most of the efficient Mathematica constructions I am aware of, both for in-memory work and interaction with the file system.

Code

Here is the code:

Clear[words];
words[text_String] :=  ToLowerCase[StringCases[text, WordCharacter ..]];

(* Rules to replace words with integer indices, and back *)
Clear[makeWordIndexRules];
makeWordIndexRules[sym_Symbol, words : {__String}] :=
   With[{distinctWords = DeleteDuplicates[words]},
    sym["Direct"] = Dispatch[Thread[distinctWords -> Range[Length[distinctWords]]]];
    sym["Inverse"] = Dispatch[Thread[ Range[Length[distinctWords]] -> distinctWords]];
    sym["keys"] = distinctWords;
];

(* Make a symbol with DownValues / OwnValues self - uncompressing *)
ClearAll[defineCompressed];
SetAttributes[defineCompressed, HoldFirst];
defineCompressed[sym_Symbol, valueType_: DownValues] :=
  With[{newVals = 
     valueType[sym] /.
       Verbatim[RuleDelayed][
         hpt : Verbatim[HoldPattern][HoldPattern[pt_]], rhs_] :>
            With[{eval = Compress@rhs}, hpt :> (pt = Uncompress@ eval)]
        },
        ClearAll[sym];
        sym := (ClearAll[sym]; valueType[sym] = newVals; sym)
];


(* Get a list of indices corresponding to a full list of words in a text *)
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
    Developer`ToPackedArray[words /. sym["Direct"]];

(* Compute the combinations and their frequencies *)
Clear[getSortedNgramsAndFreqs];
getSortedNgramsAndFreqs[input_List, n_Integer] :=
   Reverse[#[[Ordering[#[[All, 2]]]]]] &@ Tally[Partition[input, n, 1]];


(* 
 ** Produce n-grams and store them in a hash-table. We split combinations from
 ** their frequencies, and assume indices for input, to utilize packed arrays 
 *)
Clear[produceIndexedNgrams];
produceIndexedNgrams[sym_Symbol, input_List, range : {__Integer}] :=
 Do[
   With[{ngramsAndFreqs = getSortedNgramsAndFreqs[input, i]},
     sym["NGrams", i] = Developer`ToPackedArray[ngramsAndFreqs[[All, 1]]];
     sym["Frequencies", i] =  Developer`ToPackedArray[ngramsAndFreqs[[All, 2]]]
   ],
   {i, range}];


(* Higher - level function to preprocess the text and populate the hash - tables *)
ClearAll[preprocess];
SetAttributes[preprocess, HoldRest];
preprocess[text_String, inputWordList_Symbol, wordIndexRuleSym_Symbol,
    ngramsSym_Symbol, nrange_] /; MatchQ[nrange, {__Integer}] :=
  Module[{},
    Clear[inputWordList, wordIndexRuleSym, ngramsSym];
    inputWordList = words@text;
    makeWordIndexRules[wordIndexRuleSym, inputWordList];
    produceIndexedNgrams[ngramsSym,
    getWordsIndices[wordIndexRuleSym, inputWordList], nrange]
  ];

(* Higher - level function to make the definitions auto-uncompressing and save them*)
ClearAll[saveCompressed];
SetAttributes[saveCompressed, HoldRest];
saveCompressed[filename_String, inputWordList_Symbol, wordIndexRuleSym_Symbol, 
   ngramsSym_Symbol] :=
Module[{},
   defineCompressed /@ {wordIndexRuleSym, ngramsSym};
   defineCompressed[inputWordList, OwnValues];
   DumpSave[filename, {inputWordList, wordIndexRuleSym, ngramsSym}];
];

The above functionality is very memory - hungry: for processing of @Ian's file it took at some point nearly 5Gb of RAM. However, this is worth it, and one can also test the above with smaller files if there is not enough RAM. Generally, large files could be split into several parts, to work around this problem.

Preprocessing and saving in optimized format

We now start. Preprocessing takes about a minute on my machine:

test = Import["C:\\Temp\\lowered-text-50.txt", "Text"];

In[64]:= preprocess[test,inputlower,wordIndexRules,ngrams,{2,3}];//Timing
Out[64]= {55.895,Null}

The symbols inputlower, wordIndexRules, ngrams here are some symbols I chose to use for the list of words in a file and for hash-tables. Here are some additional inputs illustrating how these symbols are used and what they mean:

In[65]:= ByteCount[inputlower]
Out[65]= 459617456

In[69]:= inputlower[[1000;;1010]]
Out[69]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[67]:= toNumbers = inputlower[[1000;;1010]]/.wordIndexRules["Direct"]
Out[67]= {58,220,28,58,392,393,25,1,216,25,1}

In[68]:= toWords =toNumbers/. wordIndexRules["Inverse"]
Out[68]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[70]:= {ngrams["NGrams",2],ngrams["Frequencies",2]}//Short
Out[70]//Short= {{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}

The main idea here is that we use integer indices instead of words (strings), which allows us to utilize packed arrays for n-grams.

Compressing and saving takes another half a minute:

In[71]:= saveCompressed["C:\\Temp\\largeTextInfo.mx", inputlower, 
      wordIndexRules, ngrams] // Timing
Out[71]= {30.405, Null}

The resulting .mx file is about 63MB, which is about the size of the original file. Note that since a part of what we save is a (self-compressed) variable inputlower, which contains all the input words in the original order, we are not losing any information as compared to the original file. In principle, one can from now on start working with the new .mx file only.

Working with the optimized file

We now start a new session by quitting the kernel. It takes almost no time to load the file (.mx format is extremely efficient):

In[1]:= Get["C:\\Temp\\largeTextInfo.mx"] // Timing
Out[1]= {0.016, Null}

Loading the list of words takes some time (self - uncompressing):

In[2]:= inputlower//Short//Timing
Out[2]= {6.52,{la,présente,collection,numérisée,<<8000557>>,quicktime,3,0}}

but we don't use it for anything - it is stored just in case. Loading the 2-grams and their frequencies:

In[3]:= Timing[Short[ngrams2 = {ngrams["NGrams",2],ngrams["Frequencies",2]}]]
Out[3]= {0.639,{{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}}

Note that most of the time here was spent on self - uncompressing, which is selective (so, e.g., ngrams["NGrams",3] is still compressed). Loading the 3-grams and their frequencies:

In[4]:= Timing[Short[ngrams3 = {ngrams["NGrams",3],ngrams["Frequencies",3]}]]
Out[4]= {1.357,{{{11333,793,11334},{793,11334,11356},<<4642628>>,{18,21,22},{20,18,21}},{<<1>>}}}

The timings are decent, given the size of the lists. Note that neither DumpSave - Get nor Compress - Uncompress unpack packed arrays, so our data is stored in Mathematica's memory pretty efficiently:

In[5]:= Developer`PackedArrayQ/@ngrams3
Out[5]= {True,True}

Here we uncompress the rules relating indices to words:

In[6]:= Timing[Short[wordIndexRules["Inverse"]]]
Out[6]= {0.905,Dispatch[{1->la,2->présente,<<160350>>,160353->7631,160354->jomac},-<<14>>-]}

This is enough to start working with the data, but in the next section I outline some hints on how to make this work yet more efficient.

Efficient work with uncompressed data

If we try to find, for example, all positions of 2-grams with frequency 1, the naive way is this:

In[8]:= Position[ngrams["Frequencies",3],1,{1}]//Short//Timing
Out[8]= {1.404,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

However, we can leverage the fact that we work with integer indices (instead of words) stored in a packed array. Here is one version of the custom position function (due to Norbert Pozar):

extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]; 
positionExtr[x_List, n_] := 
    extractPositionFromSparseArray[SparseArray[Unitize[x - n], Automatic, 1]]

Using this, we get it 10 times faster (one can use a compiled to C function which is yet twice faster):

In[9]:= positionExtr[ngrams["Frequencies",3],1]//Short//Timing
Out[9]= {0.156,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

Here are some more convenience functions:

Clear[getNGramsWithFrequency];
getNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=
  Extract[ngramSym["NGrams", n], positionExtr[ngramSym["Frequencies", n], freq]];

Clear[deleteNGramsWithFrequency];
deleteNGramsWithFrequency[{ngrams_List, freqs_List}, freq_Integer] :=  
    Delete[#, positionExtr[freqs, freq]] & /@ {ngrams, freqs};

deleteNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=  
   deleteNGramsWithFrequency[{ngramSym["NGrams", n], ngramSym["Frequencies", n]}, freq];

Using which, we can get many things pretty efficiently. For example, delete the 2-grams with frequency 1:

In[15]:= deleteNGramsWithFrequency[ngrams,2,1]//Short//Timing
Out[15]= {0.218,{{{793,791},{25,1},{4704,791},<<696333>>,{29,66},{36,37},{18,21}},{<<1>>}}}

Or, 2-grams with frequencies less than 100 (this is sub-optimal way to do this, but it is still pretty fast):

In[17]:= (twogramsLarger100 = 
  Fold[deleteNGramsWithFrequency,deleteNGramsWithFrequency[ngrams,2,1],Range[2,100]])
  //Short//Timing
Out[17]= {0.344,{{{793,791},{25,1},{4704,791},{25,10},<<6909>>,
  {31,623},{402,13},{234,25}},{<<1>>}}}

The main idea is that integer indices play a role of "pointers" for words, and most things can be done with them. When needed, we can return to the normal words:

In[18]:= twogramsLarger100/.wordIndexRules["Inverse"]//Short//Timing
Out[18]= {0.063,{{{of,the},{de,la},<<6912>>,{société,du},{processus,de}},{<<1>>}}}

Concluding remarks

The speed-up achieved here seems substantial. One can control how much RAM is occupied by the data, by loading data in fine-grained chunks. The memory usage itself has been hugely optimized by utilizing packed arrays. The memory savings on disk are due to the combination of Compress and DumpSave. Hash-tables, Dispatch-ed rules and self-uncompressing are techniques used to make it more convenient.

There is plenty of room here for further refinements. One can split data in smaller chunks and compress / save those separately, to avoid high memory usage in intermediate steps. One can also split data according to the frequency ranges, and save the data so split into separate files, to speed up the load / self - uncompress stage. For many files, one would need to generalize this, since global symbols for hashes were used here. This seems a good place to apply some OOP techniques. Generally, this is just a starting point, but my message is that this approach IMO has a good potential for efficient work with such files.

Wow, this looks very powerful - thank you. I'll look forward to exploring this line-by-line and learning some new commands! — canadian_scholar, Nov 24 '11 at 00:18
Sometimes it's good to use WDX in place of MX for portability. I'm not sure how much slower WDX is and whether it is compressed internally. — Szabolcs, Nov 24 '11 at 08:59
@Szabolcs Thanks, a good suggestion. I never tried WDX, this looks like a good opportunity to test. Portability is always good, although for the case at hand it is not clear how critical this is to Ian. — Leonid Shifrin, Nov 24 '11 at 09:44
As of now, portability isn't that critical but could arise in the future, so very handy to know @Szabolcs — canadian_scholar, Nov 24 '11 at 14:15
@Ian Unfortunately, in a quick test, WDX export seems to be both slow and memory hungry. MX will always be the fastest, but it may be incompatible between Mma versions and the same Mma versions on different platforms (32/64 bit). One thing I'd like to point out is that when you use Mx with `DumpSave`/`Get`, it saves a *variable*, and the *same* variable name will be restored upon import. With `Import`/`Export` you save the *data*, and you can name it whatever you want on import (safer). Fortunately, you *can* use the MX format with `Import`/`Export`. — Szabolcs, Nov 24 '11 at 14:56
@Ian And you can even do `Export["data.mx.gz", data]` then `data = Import["data.mx.gz"]` which seamlessly integrates the compression into import/export with no extra work. — Szabolcs, Nov 24 '11 at 14:57
@Szabolcs Great to know - I actually haven't used the MX format before, so am very happy to work with it. I've had trouble keeping data in a consistant format between sessions, so I'm looking forward to using it instead of doing everything w/ TXT files. — canadian_scholar, Nov 24 '11 at 14:59
@Szabolcs I generally agree with the remark on safety. I mentioned, although did not have time to implement it, that for the case of several files, some OOP - style encapsulation mechanism can be used, to make the process safe. In particular, the symbols would be generated and guaranteed to be unique (across Mathematica sessions), and will be hidden behind the interface functions. This is quite possible to do. We anyway probably need to represent state here, and I don't see the way around it if we want to be able to work with several (many) files simultaneously time and memory efficiently. — Leonid Shifrin, Nov 24 '11 at 15:07
@Leonid But there's no need to do implement it ourselves, since as I mentioned we can just use `Import`/`Export` with the MX format (as I discovered today!) — Szabolcs, Nov 24 '11 at 15:12
@Szabolcs I meant that, generally, I make use of symbols through things like self-uncompressing `DownValues` and `OwnValues`, which makes it easy to build a data structure with rather complex behavior around those. You can certainly deconstruct those, and instead construct the large Mathematica expression containing those pieces as its parts. However then, you should constantly watch that your accessor functions are up to date with your file format. In other words, you reomove some risks but IMO increase the complexity in other places in the program. My idea is some OO layer is implemented ... — Leonid Shifrin, Nov 24 '11 at 15:19
@Szabolcs ... on top of the code above, to hide the symbols and effectively serialize the entire object incorporating both state (symbols) and behavior, and providing some easy-to-use interface to the user. In that case, loading the .mx file would ideally deserialize and reconstruct the object *and* its interface (since you can put functions in .mx file too). In this way, there is no danger that the file content would ever go out of sync with the interface - may be, some new functions won't be supported, but that will at least be obvious. If I have some more time soon, I hope to illustrate it. — Leonid Shifrin, Nov 24 '11 at 15:23
@Szabolcs But this of course does not detract from your very nice observation about `Import` and .mx files! — Leonid Shifrin, Nov 24 '11 at 15:25

score 7 · Answer 2 · answered Nov 23 '11 at 22:57

These slides are the current best wisdom on how to deal with importing and working with large sets of data:

http://library.wolfram.com/infocenter/Conferences/8025/

It covers some of the subjects mentioned here and gives some graphs which will show you how much of a speed up you can expect from switching away from Import.

score 6 · Answer 3 · edited May 23 '17 at 10:27

These are my suggestions:

I suggest using ReadList[file, Word]. Usually it is much faster than Import. This will also break it into words.
You can consider working with gzip-compressed files too. Import/Export support these seamlessly, but ReadList does not. For disk-limited operations this will actually be faster than reading/writing uncompressed data.
Your Sorts may be slow (I haven't tested your operations with large files, so I'm not sure). See yesterday's question on how to do this fast.

You can't break out from Tally before it finishes, but you can always use Select, Cases or DeleteCases to trim the bigram list before exporting.

Finally, as an answer to your last question: I am afraid Mathematica is only efficient/convenient to work with of you load all data into memory. The system seems to be conceived to work well with in-memory data only. This is from personal experience.

EDIT Working with your 50 MB text file is slow, but still bearable on my (rather old and slow) machine. Just make sure you use SortBy:

In[1]:= $HistoryLength = 0; (* save memory *)

In[2]:= Timing[
 data = ReadList["~/Downloads/lowered-text-50.txt", Word, 
    WordSeparators -> {" ", "\t", ".", ","}];]

Out[2]= {6.10038, Null}

In[3]:= Timing[counts = Tally@Partition[data, 2, 1];]

Out[3]= {87.3695, Null}

In[4]:= Timing[counts = SortBy[counts, Last];]

Out[4]= {28.7538, Null}

In[5]:= Timing[counts = DeleteCases[counts, {_, 1}];]

Out[5]= {3.11619, Null}

I couldn't get ReadList to handle UTF-8 properly, so you may need to stick to Import there.

These are great suggestions. I'll try using the `ReadList` command instead for these operations - and with compressed files as well. Re: memory, at some point I hope to upgrade to a better system (to really take advantage of Mathematica's parallel powers, too). — canadian_scholar, Nov 23 '11 at 20:17
I wish I could accept more than one answer here, as these are all really wonderful and well thought out solutions. My sincerest thanks again. — canadian_scholar, Nov 24 '11 at 14:16

score 5 · Answer 4 · edited May 23 '17 at 10:27

To extend a comment I made, Read is a useful alternative to ReadList or Import. Like ReadList, you specify a type, and if you specify String it reads in the entire line. So, you can process the entire file, one (or more) lines at a time. The only difficulty, is that you have to watch for EndOfFile by hand. For instance,

strm = OpenRead[file];
While[ (line = Read[ str, String ]) =!= EndOfFile,
 (* Do something with the line *)
];
Close[ strm ];

To extend this to multiple lines at once, replace String above with a list the length of the number of lines you wish to process at once containing only String. This is best accomplished using ConstantArray[String, n] for more than a few lines. Of course, Word can be used instead, to process the file on a word for word basis.

Processing files line-by-line has a down side, if you need to Abort the process, strm will be left open. So, I suggest wrapping the code in CheckAbort, or using the functionality described here.

score 3 · Answer 5 · edited May 23 '17 at 11:55

You might look at "String Patterns", which are Mathematica's version of Regular Expressions. Perhaps something like StringCases[data, RegularExpression["\\w+?\\W+?\\w+?"]], which should return all the matching sequences of word-whitespace-word. I can't say whether this will be faster than your partition code or not.

There's a "Tips For Efficient Matching" at the bottom of that page.

You could apply "DeleteDuplicates" to trim the list before sorting.

If I were doing this in another language, I would store the n-grams in a Hash Table, with the text as key, and the instance count as a value. This would work well with a line-by line file parser. However, it seems using a Hash Table in Mathematica is not straightforward.

And one observation: Instead of running 4 passes, you could just generate all the 5-grams to a single file, and use simple command line text processing to generate the 2-, 3- and 4-grams from that file. Of course, this is only useful after you get the 5-gram extractor to run in a reasonable time.

See Also ["Performance Tuning in Mathematica"](http://stackoverflow.com/q/4721171/10396) — AShelly, Nov 23 '11 at 19:47
This is great, thank you. I will give `StringCases` a try and compare them with `Timing` - I'm curious to see what method works best. Certainly, `StringCases[]` seems very versatile.. — canadian_scholar, Nov 23 '11 at 19:50
One more link: http://stackoverflow.com/questions/7525782/import-big-files-arrays-with-mathematica — AShelly, Nov 23 '11 at 20:13