13

I'm currently undertaking operations on a very large body of text (~290MB of plain text in one file). After importing it into Mathematica 8, I'm currently beginning operations to break it down into lowercase words, etc. so I can begin textual analysis.

The problem is that these processes take a long time. Would there be a way to monitor these operations through Mathematica? For operations with a variable, I've used ProgressIndicator etc. But this is different. My searching of documentation and StackOverflow has not turned up anything similar.

In the following, I would like to monitor the process of the Cases[ ] command:

input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];
canadian_scholar
  • 1,315
  • 12
  • 26
  • I wonder if your question is about monitoring the `Cases[]` progress, or about optimizing your code. They are two [entirely unlike](http://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#Not_entirely_unlike) problems – Dr. belisarius Oct 18 '11 at 12:38
  • @belisarius Almost, but not entirely.. I gather from the responses that my need/request to monitor `Cases[]` stems from some slower choices in my code. Also, perhaps there is no readily apparent way to monitor such progress.. – canadian_scholar Oct 18 '11 at 12:56

4 Answers4

11

Something like StringCases[ToLowerCase[input], WordCharacter..] seems to be a little faster. And I would probably use DeleteCases[expr, ""] instead of Cases[expr, Except[""]].

Joshua Martell
  • 7,074
  • 2
  • 30
  • 37
10

It is possible to view the progress of the StringSplit and Cases operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed by StringSplit and the second showing the number of words processed by Cases:

input = ExampleData[{"Text", "PrideAndPrejudice"}];

wordList =
  Module[{charCount = 0, wordCount = 0, allWords}
  , PrintTemporary[
      Row[
        { "Characters: "
        , ProgressIndicator[Dynamic[charCount], {0, StringLength@input}]
        }]]

  ; allWords = StringSplit[
        ToLowerCase[input]
      , (_ /; (++charCount; False)) | Except[WordCharacter]
      ]

  ; PrintTemporary[
      Row[
        { "Words:      "
        , ProgressIndicator[Dynamic[wordCount], {0, Length@allWords}]
        }]]

  ; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]

  ]

The key to the technique is that the patterns used in both cases match against the wildcard _. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.

WReach
  • 18,098
  • 3
  • 49
  • 93
  • 1
    +1 just be aware that in my machine this is 7 times slower than the same non monitored code – Dr. belisarius Oct 18 '11 at 14:12
  • Very clever! +1. You can simply use `Except[WordCharacter]/;(++charCount;True)` and `Except[""] /; (++wordCount; True)` instead of `(_ /; (++charCount; False)) | Except[WordCharacter]` and `(_ /; (++wordCount; False)) | Except[""]` with the same success but with more efficiency. Usage of `DeleteCases` instead of `Cases` may give even more speedup as Joshua Martell [points out](http://stackoverflow.com/questions/7801897/monitoring-process-of-cases-on-a-very-large-body-of-information/7802280#7802280). – Alexey Popkov Oct 18 '11 at 14:24
  • 1
    @Alexey That is what I tried at first, but it did not count all characters and words -- only those that matched the pattern. – WReach Oct 18 '11 at 14:26
  • Addition: usage of `/; NumberQ[++charCount]` and `/; NumberQ[++wordCount]` gives a little even more speedup and shorter code. – Alexey Popkov Oct 18 '11 at 14:37
  • @WReach Now I understand what you mean. Interesting. – Alexey Popkov Oct 18 '11 at 14:48
  • Fantastic - let me give this a try as well. This is a really interesting approach that has broad application. – canadian_scholar Oct 18 '11 at 15:01
5

It depends a little on what your text looks like, but you could try splitting the text into chunks and iterate over those. You could then monitor the iterator using Monitor to see the progress. For example, if your text consists of lines of text terminated by a newline you could do something like this

Module[{list, t = 0},
 list = ReadList["/users/USER/alltext.txt", "String"];
 Monitor[wordlist = 
   Flatten@Table[
     StringCases[ToLowerCase[list[[t]]], WordCharacter ..], 
      {t, Length[list]}], 
  Labeled[ProgressIndicator[t/Length[list]], N@t/Length[list], Right]];
 Print["Ready"]] 

On a file of about 3 MB this took only marginally more time than Joshua's suggestion.

Heike
  • 24,102
  • 2
  • 31
  • 45
4

I don't know how Cases works, but List processing can be time consuming, especially if it is building the List as it goes. Since there is an unknown number of terms present in the processed expression, it is likely that is what is occurring with Cases. So, I'd try something slightly different: replacing "" with Sequence[]. For instance, this List

{"5", "6", "7", Sequence[]}

becomes

{"5", "6", "7"}.

So, try

bigList /. "" -> Sequence[]

it should operate faster as it is not building up a large List from nothing.

rcollyer
  • 10,475
  • 4
  • 48
  • 75
  • This is an excellent suggestion - I will try implementing it. Code efficiency is the root problem here! – canadian_scholar Oct 18 '11 at 03:30
  • 3
    @rcollyer I wouldn't worry about the internal list-building happening in `Cases`. It surely is optimized for list-building and is free from the `AppendTo` syndrome (quadratic list-building complexity). It is in fact somewhat *more* efficient than the method with `Sequence`. – Leonid Shifrin Oct 18 '11 at 03:32
  • 1
    @Leonid, I've had trouble with built-in functions in the past usually involving list generation. (Unfortunately, no specific example comes to mind.) And, I'll admit, I did not test this. I was merely offering a possible alternative. – rcollyer Oct 18 '11 at 03:36
  • 1
    @ian.milligan The real efficiency gains will likely lie in avoiding using Mathematica's patterns for text manipulations for "as long as possible", but using string patterns, regular expressions, etc. Keep in mind that many string-processing functions like `StringCases` also work on lists of strings and are very fast. – Leonid Shifrin Oct 18 '11 at 03:36
  • @rcollyer Sure, alternatives are always good. I just wanted to point out that `Cases` does not suffer from this particular deficiency. – Leonid Shifrin Oct 18 '11 at 03:38
  • @Leonid, not a problem. I rarely use `Cases`, and my last large data set, I shrunk via `SparseArray` (mostly 0s in an 80^3 array). – rcollyer Oct 18 '11 at 04:14