1

I'm having 2 lists like below.

List<String> list1 = Arrays.asList("I'm a cat", "dog", "There's an elephant and I'm seeing", "we're five");

List<String> list2 = Arrays.asList("I'm", "There's", "we're");

and a hash map as below.

"I'm": "I am"
"we're": "we are"
"There's": "there is"

Here I need to update my list1 with the dictionary values. i.e. it should be as

"I am a cat", "dog", "There is an elephant and I am seeing it", "we are five"

Here my main problem is the list1 that I've provided has close to 80K sentences and the map is of 4k values. Here I'm able to generate all the list1 , list2 and map. but since it is very huge I'm unable to find an efficient way of doing a find and replace.

I thought of using commons StringUtils.replaceAll() by converting my lists into arrays, but again the issue is I'll need to loop through all the 80k items * 4k times (even more as they are statements rather than single word strings).

How can I do it?

halfer
  • 19,824
  • 17
  • 99
  • 186
Rakesh
  • 564
  • 1
  • 8
  • 25
  • 1
    What about the ParallelStream in java 8 – Youcef LAIDANI May 13 '18 at 12:38
  • ..but, the question is, what is the source of data in the list? did you type it in the code, or you fill it from a file or database? – Youcef LAIDANI May 13 '18 at 12:42
  • if the query patterns stay the same and texts are different, it makes sense to construct an FSM based on query strings (in your case - set of map keys), that will optimize a pattern search, but you still will have to process all 80K entries one by one – mangusta May 13 '18 at 12:48
  • Can't you get the list of string into a single string variable with some `delimiter` & apply `StringUtils.replaceAll()` . And at the end with the delimiter you split out the string into string array. So you only need to loop through the `Map` you have. – Abid Khan May 13 '18 at 12:57
  • Hi All apologies for the delayed response. I've my data in an excel and I'm using poi and building the lists and map – Rakesh May 13 '18 at 13:44

3 Answers3

0

You can perform the substitutions in a single pass. Arrange for the text to be stored as a single string so that you can operate on the input in bulk. You can use an appropriate delimiter so that you can separate the strings when the translation is done.

Prepare a regular expression (or generate a state machine based tokenizer using a tool like JFlex) that matches any of the strings to be replaced (the keys in your map). Then iterate over each match and perform the substitution.

Here's an example of using Pattern to perform the replacements in bulk:

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.stream.Collectors;

public class Substituter {
    public static void main(String args[]) {
        // Read the input into a string (or combine the inputs if needed)

        List<String> strings = Arrays.asList("I'm a cat", "dog", "There's an elephant and I'm seeing", "we're five");

        // String replacements

        Map<String, String> replacements = new HashMap<>();
        replacements.put("I'm", "I am");
        replacements.put("we're", "we are");
        replacements.put("There's", "there is");

        // Build the regular expression by concatenating the strings to be replaced into an or expression (|)

        Pattern pattern = Pattern.compile(replacements.keySet().stream().map(Pattern::quote).collect(Collectors.joining("|")));

        // Perform the substitutions

        Matcher m = pattern.matcher(String.join("~", strings));
        StringBuffer newText = new StringBuffer();

        while (m.find()) {
            m.appendReplacement(newText, replacements.get(m.group()));
        }

        m.appendTail(newText);

        // Split the output into separate strings if needed

        List<String> newStrings = Arrays.asList(newText.toString().split("~"));
        System.out.println("Original strings: " + strings);
        System.out.println("New strings: " + newStrings);
    }
}

Output:

Original strings: [I'm a cat, dog, There's an elephant and I'm seeing, we're five]
New strings: [I am a cat, dog, there is an elephant and I am seeing, we are five]
jspcal
  • 50,847
  • 7
  • 72
  • 76
0

Here is another version, I found this post and modified the program a little bit...

Map <String, String> tokenMap = new HashMap <> ();
tokenMap.put("I'm", "I am");
tokenMap.put("We're", "We are");

String [] array = {"I'm at home" , "We're playing football"};

String content = Arrays.toString(array).substring(1, Arrays.toString(array).length() - 1);
String regex = StringUtils.join( tokenMap.keySet(), "|");
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);

StringBuffer buffer = new StringBuffer();

while(matcher.find())
{
    matcher.appendReplacement(buffer,  tokenMap.get(matcher.group(0)));
}

matcher.appendTail(buffer);
array = buffer.toString().split(", ");

I don't know how efficient it is, I tested it only with few elements...

0x1C1B
  • 1,204
  • 11
  • 40
0

I would like to use Parallel Stream from Java 8+, combining with Apache Commons - Lang which provide a good functionality replaceEach(String text, String[] searchList, String[] replacementList) :

List<String> list = ...
Map<String, String> mapReplacement = ...
//replaceEach take a String String array of search words, String array of replacement
String[] keys = mapReplacement.keySet().toArray(new String[map.size()]);
String[] values = mapReplacement.keySet().toArray(new String[map.size()]);

list = list.parallelStream()
        .map(element -> StringUtils.replaceEach(element, keys, values))
        .collect(Collectors.toList());

Note

But It still unclear from where you get this data, if from database then its better to solve in database, instead in java code, personally I don't like this huge data in the list and the map.

Youcef LAIDANI
  • 55,661
  • 15
  • 90
  • 140