-2

Using a general-purpose programming language like Java, what is the most efficient way to search through a ~20 page document to replace a set of 5000+ strings with some predetermined replacement string? The program should not replace any strings that have already been replaced. What data structure would be optimal to store the 5000+ strings and each of their replacements - two arrays, a dictionary, or something else?

Here are some of the options that I have considered so far:

  • Iterate through the entire .txt document once time per string using string.replace. The problem is that the algorithm must iterate through the entire .txt document an extra time for each string stored.

  • Iterate through the .txt once while replacing string as necessary while creating a new string by appending replacements. This seems more efficient, but each step would still require checking the entire set of 5000+ strings for any strings to replace.

Is there a more optimized means of solving this problem, or is one of the above attempts already optimal?

Also, would it be possible to run this algorithm more efficiently in a lower-level language like C?

Alekxos
  • 512
  • 6
  • 13
  • 3
    Show us the code you tried. Twenty pages and 5000 strings is not that many. What is wrong with the code you are using now? – markspace Aug 22 '14 at 17:56
  • Use `ProcessBuilder` to have your program call `sed`??? (OK, that's the Old Linux Programmer's solution...) – ajb Aug 22 '14 at 17:57
  • Why do you require two passes to replace a string? – user1231232141214124 Aug 22 '14 at 17:57
  • Maybe you should use shell http://www.grymoire.com/Unix/Sed.html – Diogo Moreira Aug 22 '14 at 17:57
  • open the file in notepad and press ctrl-h – j.con Aug 22 '14 at 17:58
  • If building significantly sized output incrementally, use a StringBuilder, not string concatenation (although sometimes javac helps behind the scenes). There is no need for repeated calls to String.replace (and some subtle issues associated with it). I recommend using a generated regular expression (with replaceAll or a regex-replace-walk), but even a manual word-by-word (with probe into appropriate lookup) loop will be fast. – user2864740 Aug 22 '14 at 17:58
  • 1
    Practically anything will run better using C – user1231232141214124 Aug 22 '14 at 17:58
  • I haven't written any code yet because I am looking for the most efficient algorithm first. I am sure that either of the two approaches that I mentioned would work, but I was wondering if there is an accepted maximally optimized algorithm for this situation. – Alekxos Aug 22 '14 at 17:58
  • public Matcher appendReplacement(StringBuffer sb, String replacement) – user3487063 Aug 22 '14 at 17:58
  • 2
    Well heres a tip, WRITE the algorithm, then you optimize. I still don't know why you would need to passes to replace strings – user1231232141214124 Aug 22 '14 at 17:58
  • 1
    @AlexBoulton: what redFIVE said. Write some code, making obvious optimizations as they occur to you. Ie, a `StringBuilder` is probably best here. Then measure the results. If it's fast enough, you're done. If not, come back here as ask for a better method than the code you have. Be prepared to show us enough of the code so that we understand what you did. – markspace Aug 22 '14 at 18:01
  • The two bullet points don't mean two full iterations - those are just two separate approaches. – Alekxos Aug 22 '14 at 18:01
  • Seriously, I think I'd try joining all the search strings together with `|` to make a regex, then using just one `replaceAll`. This could be faster or much, much slower depending on how regexes are implemented, but it might be worth a try. – ajb Aug 22 '14 at 18:03
  • This is going to run practically instantaneously however you do it, as long as your code isn't grossly inefficient. Buffered input and output is the main concern. – user207421 Aug 22 '14 at 18:18
  • Are you replacing each of 5000 strings with one string - the same string - every time? Or does each of 5000 strings have its own replacement - ie 5000 replacements for 5000 strings? This is not clear to me from the question. – Stewart Aug 23 '14 at 01:58

2 Answers2

2

You want to replace some string in 5000 strings and you want to make it optimal ... Now my question to you is: How will you know if you have to replace a string if you dont read the string? It's not possible, you have to read everything. And the shortest way to do that is to go line by line and replace immediatly. And somebody can correct me if i'm wrong, but reading a file is one of the most basic operations there is so using a library for that besides what is available by default in the programming language seems total overkill to me. Furthermore, every language has basic io and if it doesn't then don't use it.

To store strings, it all depends what you want to do with them. Different data structures have different purposes and some are better suited in some situations then others. If you just need to store them then a simple array is fine. However, if you need more advanced functions then you need to consider your options. But again it's all up to what you want to do with them later. And there is the memory issue, you need to calculate how much memory your 5000+ strings will take, because you might run out of memory. Then you need to think if it's worth it to use all that memory. check this link

Finally your question about C, ofcourse it will be more efficient. Java runs in a virtual machine that adds considerable overhead. So basically your Java program runs in another Java program and if you know that there is a cost for every single operation then you understand that C will be more efficient then Java in terms of performance.

Community
  • 1
  • 1
fonZ
  • 2,428
  • 4
  • 21
  • 40
  • And herein lies the problem of not actually writing any code before asking a question on SO – user1231232141214124 Aug 22 '14 at 18:19
  • Just put 5000 strings in `hashmap` and you're golden – user1231232141214124 Aug 22 '14 at 18:30
  • @redFIVE Im just saying he should consider size and memory availability. Furthermore, some data structures have different complexity which might make them more or less efficient with more or less data. – fonZ Aug 22 '14 at 18:34
  • I think you will be hard pressed to find a more efficient method of finding a string out of 5000 possible unsorted entries than sticking it all in a hashmap – user1231232141214124 Aug 22 '14 at 18:36
  • I think the question remains, what are you going to do with it? I agree that a (hash)map is a plausible solution. But what if he only replaces 2 strings? Then you will store 5000+ strings? Then find the replacements? – fonZ Aug 22 '14 at 18:39
  • 1
    @redFIVE only if the _entire_ input string is supposed to match one of the 5000 possibilities. If he's looking for cases where substrings of the input string match one of the possible search strings (which is how I read the question), a hashmap is not useful. – ajb Aug 22 '14 at 18:40
0

I would use the commons-lang library, which I think has exactly what you are looking for. Basically you create one array with all the strings you want to substitute and another array with the substitutions. See http://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html for details on the StringUtils#replaceEach method.

jjathman
  • 12,536
  • 8
  • 29
  • 33