0

I'm trying to tokenize a large amount of text in Java. When I say large, I mean entire chapters of books at a time. I wrote the first draft of my code by using a single page from a book and everything worked fine. Now that I'm trying to process entire chapters things aren't working. It processes part of the chapter correctly and then it just stops.

Below is all of the relevant code

File folder = new File(Constants.rawFilePath("eng"));
    FileHelper fileHelper = new FileHelper();
    BPage firstChapter = new BPage();
    BPage firstChapterSpanish = new BPage();
    File[] allFiles = folder.listFiles();
    //read the files into memory
    ArrayList<ArrayList<String>> allPages = new ArrayList<ArrayList<String>>();

    //for the english
    for(int i=0;i<allFiles.length;i++)
    {
        String filePath = Constants.rawFilePath("/eng/metamorph_eng_"+String.valueOf(i)+".txt");
        ArrayList<String> pageToAdd = fileHelper.readFileToMemory(filePath);
        allPages.add(pageToAdd);
    }

    String allPagesAsString = "";

    for(int i=0;i<allPages.size();i++)
    {
        allPagesAsString = allPagesAsString+fileHelper.turnListToString(allPages.get(i));
    }

    firstChapter.setUnTokenizedPage(allPagesAsString);
    firstChapter.tokenize(Languages.ENGLISH);

    folder = new File(Constants.rawFilePath("spa"));
    allFiles = folder.listFiles();
    //for the spanish
    for(int i=0;i<allFiles.length;i++)
    {
        String filePath = Constants.rawFilePath("/eng/metamorph_eng_"+String.valueOf(i)+".txt");
        ArrayList<String> pageToAdd = fileHelper.readFileToMemory(filePath);
        allPages.add(pageToAdd);
    }

    allPagesAsString = "";

    for(int i=0;i<allPages.size();i++)
    {
        allPagesAsString = allPagesAsString+fileHelper.turnListToString(allPages.get(i));
    }

    firstChapterSpanish.setUnTokenizedPage(allPagesAsString);
    firstChapterSpanish.tokenize(Languages.SPANISH);

    fileHelper.writeFile(firstChapter.getTokenizedPage(), Constants.partiallyprocessedFilePath("eng_ch_1.txt"));
    fileHelper.writeFile(firstChapterSpanish.getTokenizedPage(), Constants.partiallyprocessedFilePath("spa_ch_1.txt"));
}

even though I'm reading all of the files in the directory where I expect my text to be, only the first coups of files are being added to the string that I'm processing. It seems like after a while the code will still run but it only adds characters to my string up to a certain point.

What do I have to change so that I can process all of my files at once?

j.jerrod.taylor
  • 1,120
  • 1
  • 13
  • 33
  • 1
    Define "just stops". Do you get an error message? If so, what's the message? (If not, it's probably either a bug in your code, or it hasn't actually stopped but is swapping badly enough to drag performance to a crawl.) What have you done so far to try to diagnose the problem? – keshlam Jan 21 '14 at 20:22
  • Any empty catch blocks? Strongly consider doing some logging with a logging framework. – Hovercraft Full Of Eels Jan 21 '14 at 20:23
  • 1
    At what point in the code does it "just stop"? It looks like you are doing a lot of String concatenation (`allPagesAsString`), so you might want to replace that with a [`StringBuilder`](http://docs.oracle.com/javase/7/docs/api/java/lang/StringBuilder.html), which is a lot more efficient. Have a look at some of the answers to [this question](http://stackoverflow.com/questions/4645020/when-to-use-stringbuilder-in-java). – andersschuller Jan 21 '14 at 20:24
  • If there's no need to make the read/write operations sequentially I suggest multi-threading – Typo Jan 21 '14 at 20:29
  • @keshlam I don't get an error message. I haven't been able to find a bug. It will process the first 2 or 3 pages without any problem. Even when I step through the code it doesn't look like anything is going wrong and I can step all the way through. – j.jerrod.taylor Jan 21 '14 at 20:31
  • If you can step all the way through, but can't run all the way through, then either you have a multithreading problem (which I don't see in the code you've shown us, but which could be elsewhere) and a timing dependency, or you have a JVM issue. Much more likely the former. – keshlam Jan 21 '14 at 20:39
  • @JuanManuel I'm trying to use [hunalign](http://mokk.bme.hu/en/resources/hunalign/) to do sentence alignment on a book that is written in two different languages. Where a page end in one language isn't necessarily where it ends in another language. Processing a chapter at a time would be the easiest solution (in my opinion) because chapters start and stop at the same place no matter the language that they are written in. – j.jerrod.taylor Jan 21 '14 at 20:40
  • @keshlam I was assuming that I simply reached the limit for how big strings can be. I'm not doing any multithreading. – j.jerrod.taylor Jan 21 '14 at 20:42
  • Chapters of books is not "really, really large". Tolstoy's _War and Peace_ in text format from Project Gutenberg is only about 3.2 Mb. – Kaz Feb 27 '14 at 18:48

1 Answers1

2

This part

String allPagesAsString = "";

for(int i=0;i<allPages.size();i++)
{
    allPagesAsString = allPagesAsString+
       fileHelper.turnListToString(allPages.get(i));
}

will be really slow if your copying larger strings.

Using a StringBuilder will speed things up a bit:

int expectedBookSize = 10000;
StringBuilder allPagesAsString = new StringBuilder(expectedBookSize); 
for(int i=0;i<allPages.size();i++)
{
        allPagesAsString.append(fileHelper.turnListToString(allPages.get(i)));
}

Can't you process one page at a time? That would be the best solution.

Ishtar
  • 11,542
  • 1
  • 25
  • 31
  • I can't really process one page at a time. I'm tokenizing my string so that I can do sentence alignment on passages a book that is translated into two different languages. Where a page ends in one language isn't necessarily where it ends in another language but all chapters start and stop at the same place. – j.jerrod.taylor Jan 21 '14 at 20:34
  • It looks like your suggestion of using StringBuilder instead of String worked. Thanks. – j.jerrod.taylor Jan 21 '14 at 20:52