3

What are some generic methods for optimizing a program in Java, in terms of speed. I am using a DOM Parser to parse an XML file and then store certain words in an ArrayList, remove any duplicates then spell check those words by creating Google search URL's for each word, get the html document, locate the corrected word and save it to another ArrayList.

Any help would be appreciated! Thanks.

TookTheRook
  • 817
  • 4
  • 14
  • 31

4 Answers4

3

SAX is faster than DOM. If you don't want to go through the ArrayList searching for duplicates, put everything in a LinkedHashMap -- no duplicates, and you still get the order-of-insertion that ArrayList gives you.

But the real bottleneck is going to be sending the HTTP request to Google, waiting for the response, then parsing the response. Use a spellcheck library, instead.

Edit: But take my educated guesses with a grain of salt. Use a code profiler to see what's really slowing down your program.

Nick
  • 2,827
  • 4
  • 29
  • 39
  • The files themselves aren't that big so I don't think using SAX would make that much of a difference. But yeah, I need to find ways to reduce the number of HTTP requests to Google. Thanks for your feedback! – TookTheRook Nov 17 '10 at 04:21
3

Why do you need to improve performance? From your explanation, it is pretty obvious that the big bottleneck here (or performance hit) is going to be the IO resulting from the fact that you are accessing a URL.

This will surely dwarf by orders of magnitude any minor improvements you make in data structures or XML frameworks.

It is a good general rule of thumb that your big performance problems will involve IO. Humorously enough, I am at this very moment waiting for a database query to return in a batch process. It has been running for almost an hour. But I welcome any suggested improvements to my XML parsing library nevertheless!

Here are my general methods:

  • Does your program perform any obviously expensive task from the perspective of latency (IO)? Do you have enough logging to see that this is where the delay is (if significant)?

  • Is your program prone to lock-contention (i.e. can it wait around, doing nothing, waiting for some resource to be "free")? Perhaps you are locking an entire Map whilst you make an expensive calculation for a value to store, blocking other threads from accessing the map

  • Is there some obvious algorithm (perhaps for data-matching, or sorting) that might have poor characteristics?

  • Run up a profiler (e.g. jvisualvm, which ships with the JDK itself) and look at your code hotspots. Where is the JVM spending its time?

oxbow_lakes
  • 133,303
  • 56
  • 317
  • 449
  • So the generic method here is: Profile the code to see where the most time is spent. You are predicting (possibly correctly) the result of that and reaching a (possibly correct) conclusion of: Nothing to do :-) This depends on the size of the original XML file and where it comes from. – phkahler Nov 16 '10 at 22:07
  • Haha - indeed I am. Based on many years of experience. My DB query has returned now! – oxbow_lakes Nov 16 '10 at 22:22
  • Yeah, the major slow down is because of the HTTP requests. Since some of the words are repeated, I think using a set to first check if the word has already been queried should improve the speed. Not much else to do. I would use SAX Parser but the individual XML files by themselves are not that big. Thank you for your input oxbow! – TookTheRook Nov 17 '10 at 04:20
1

Generally the best method is to figure out where your bottleneck is, and fix it. You'll usually find that you spend 90% of your time in a small portion of your code, and that's where you want to focus your efforts.

Once you've figured out what's taking a lot of time, focus on improving your algorithms. For example, removing duplicates from an ArrayList can be O(n²) complexity if you're using the most obvious algorithm, but that can be reduced to O(n) if you leverage the correct data structures.

Once you've figured out which portions of your code are taking the most time, and you can't figure out how best to fix it, I'd suggest narrowing down your question and posting another question here on StackOverflow.

Edit

As @oxbow_lakes so snidely put it, not all performance bottlenecks are to be found in the code's big-O characteristics. I certainly had no intention to imply that they were. Since the question was about "general methods" for optimizing, I tried to stick to general ideas rather than talking about this specific program. But here's how you can apply my advice to this specific program:

  1. See where your bottleneck is. There are a number of ways to profile your code, ranging from high-end, expensive profiling software to really hacky. Chances are, any of these methods will indicate that your program spends the 99% of its time waiting for a response from Google.
  2. Focus on algorithms. Right now your algorithm is (roughly):
    1. Parse the XML
    2. Create a list of words
    3. For each word
      1. Ping Google for a spell check.
    4. Return results

Since most of your time is spent in the "ping Google" phase, an obvious way to fix this would be to avoid doing that step more times than necessary. For example:

  1. Parse the XML
  2. Create a list of words
  3. Send list of words to spelling service.
  4. Parse results from spelling service.
  5. Return results

Of course, in this case, the biggest speed boost would probably be by using spell checker that runs on the same machine, but that isn't always an option. For example, TinyMCE runs as a javascript program within the browser, and it can't afford to download the entire dictionary as part of the web page. So it packages up all the words into a distinct list and performs a single AJAX request to get a list of those words that aren't in the dictionary.

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • I'd suggest looking at the strikingly obvious first by taking a step back and looking at what your program does. Not all performance bottlenecks are to be found in the code's big-O characteristics. this one, for example. – oxbow_lakes Nov 16 '10 at 22:23
  • 1
    You might think it snide, but a *generic method* for optimizing Java is to sit back and look at what the program does, with the rule of thumb/experience that IO (or lock-contention) are likely to be major culprits. Your advice to focus on algorithms over *minutiae* is a good one though – oxbow_lakes Nov 16 '10 at 23:27
0

These folks are probably right, but a few random pauses will turn *probably" into "definitely, and here's why".

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135