How to remove similar named strings in a list?

Question

Given a list/array of strings:

document
document (1)
document (2)
document (3)
mypdf (1)
mypdf
myspreadsheet (1)
myspreadsheet
myspreadsheet (2)

How do I remove all the duplicates but retain only the highest copy number?

Ending result to be:

document (3)
mypdf (1)
myspreadsheet (2)

There are many approaches to this problem. Any code you've tried? A stack trace to check for errors? — Alfabravo, Nov 17 '16 at 14:06
Will all of your `String`s have this format? Could there be spaces in the name, something like `my file` and `my file (23)`? — bradimus, Nov 17 '16 at 14:16
For the given data set, it seems like the contains method of String could be an easy way of doing this. [oracles docs](http://docs.oracle.com/javase/8/docs/api/java/lang/String.html) — Chains, Nov 17 '16 at 14:18

GhostCat · Accepted Answer · 2016-11-17T14:24:31.040

3

You put in a broad question, so here comes an unspecific (but nonetheless) "complete" answer:

Iterate over all your strings to identify all lines that contain braces.
In other words: identify all the strings that look like "X (n)"
Then, for each "different" X that you found, you can iterate the list again; so that you can find all occurrences of "X", X (1)", .. and so on
Doing so will allow you to detect the maximum n for each of your Xes.
Push that "maximum" "X (n)" into your results list.

In other words: it only takes such a simple receipt to solve this problem; now it only takes your time to turn these pseudo-code instructions into real code.

For the record: if the layout of your file is really as shown above, then things become a bit easier - as it seems that your numbers are just increasing. What I mean is:

X (1)
X (2)
X (3)

is easier to treat than

X (1)
X (3)
X (2)

As in your case, it seems save to assume that the last X(n) contains the largest n. Which makes using a HashMap (as suggested by cainiaofei) a nice solution.

edited Nov 17 '16 at 14:24

answered Nov 17 '16 at 14:16

GhostCat

137,827
25
176
248

appreciated for the logic assistance. this is what I needed. – Zachary Loughridge Nov 17 '16 at 14:21
You welcome ;-) – GhostCat Nov 17 '16 at 14:22
And please see my updates. – GhostCat Nov 17 '16 at 14:24
I agree in the increasing numbers, however they are in no particular order. Although I bet I can order them prior to your solution – Zachary Loughridge Nov 17 '16 at 14:29
I was able to sort the list descending :thumbsup: – Zachary Loughridge Nov 17 '16 at 14:45
This solution worked great!! – Zachary Loughridge Nov 17 '16 at 19:42
@ZacharyLoughridge BE careful about that sorting there - to avoid running into situations where "X (15)" is put in front of "X (5)" for example! You know strings and numbers "sort" in different ways! – GhostCat Nov 17 '16 at 19:44
And of course: you are very welcome. I really like the fact that you didn't expect people throwing code at you, and that "this is how you get started" worked for you. – GhostCat Nov 17 '16 at 19:45

score 1 · Answer 2 · answered Nov 17 '16 at 14:17

1

an alternative solution

use a HashMap the key is the name (e.g. the name of document document (1) document (2) document (3) are all document)

which can be implement by this code str.substring(0,str.indexOf('(')).trim()

and the value is the times that key present, at last traverse the map get the key that corresponding value is max and the result is key(value-1)

answered Nov 17 '16 at 14:17

nail fei

2,179
3
16
36

Your solution simply pushes the **last** number found into the map. Which works if the format is really like the above ... but what if not?! – GhostCat Nov 17 '16 at 14:20
appreciated! just looking for a logical approach – Zachary Loughridge Nov 17 '16 at 14:20
@GhostCat Sorry, I am afraid that I dont understand what you mean. AS for the example of OP, my map will be <[document,4],[mypdf,2],[myspreadsheet,3]> – nail fei Nov 17 '16 at 14:33
@cainiaofei I mean: when you just read the file in sequence, and push document name and counter into the map ... that only works if the "counter" values are always **increasing**. – GhostCat Nov 17 '16 at 15:28
@GhostCat it has nothing with whether 'counter values' are increasing. When put the String into map, I only care the subString before *(counter)* I called it 'name' ,[*e.g. the name of a(1) and a is the same a*].So the 'order' *a(2) a a(1)* and *a a(1) a(2)* is same to me. Reading every string my operation is 'map.put(name,map.get(name+1))' – nail fei Nov 18 '16 at 01:06

score 0 · Answer 3 · answered Nov 17 '16 at 14:28

I would advice you to use a dictionnary :

Map<String, Integer> dict = new HashMap<>();
for (String s : listOfInput){
    String name = s.split(" ")[0];
    String version = s.split(" ")[1].charAt(1);
    if(dict.get(name)!=null){
        if (Integer.parseInt(version) < dict.get(name)){
            continue;
        }
    }
    dict.put(name, version); 
}

The data would be at the end in the dictionary:

key | value

document | 3

mypdf | 1

myspreadsheet | 2

Arnaud · Answer 4 · 2016-11-17T14:37:03.060

Here is a possible approach, but this will only work if the version number doesn't exceed 9 (*) :

1) Sort the list in reverse order, so that the most recent version appears first

(*) The sorting being based on alphabetical order , you should be quite fine unless your version number exceeds one digit.Because 10 for instance, appears before 9 with an alphabetical sorting.

Your list will turn into :

myspreadsheet (2)
myspreadsheet (1)
myspreadsheet
mypdf (1)
mypdf
document (3)
document (2)
document (1)
document

2) Iterate on the list, and only keep the first occurence of a given document (i.e the most recent thanks to the reverse sorting)

3) If you want to, sort back the remaining list to a more natural ordering

    List<String> documents = new ArrayList<String>();

    documents.add("document");
    documents.add("document (1)");
    documents.add("document (2)");
    documents.add("document (3)");
    documents.add("mypdf (1)");
    documents.add("mypdf");
    documents.add("myspreadsheet (1)");
    documents.add("myspreadsheet");
    documents.add("myspreadsheet (2)");

    // 1) Sort in reverse order, so that the most recent document version appears first
    Collections.sort(documents, Collections.reverseOrder());

    String lastDocumentName = "";

    ListIterator<String> iter = documents.listIterator();

    // 2)
    while (iter.hasNext()) {

        String document = iter.next();

        // Store the first part of the String , i.e the document name (without version)
        String firstPart = document.split("\\s+")[0];

        // Check if this document is a version of the last checked document
        // If it is the case, this version is anterior, remove it from the list
        if (lastDocumentName.equals(firstPart)) {

            iter.remove();

        }

        // Store this document's name as the last one checked
        lastDocumentName = firstPart;

    }

    // 3) Sort back to natural order
    Collections.sort(documents);

    for (String doc : documents) {

        System.out.println(doc);
    }

score 0 · Answer 5 · answered Nov 17 '16 at 14:36

This is a simple solution by making use of a Map. First you loop through your list, you split the String, and you add it to the map with the name as the key, and what's inside the paranthesis as a value. And for each entry you check if the key already exists. And if the key exists you compare the value and you add the next entry to the map if the value is bigger than what's already stored. At the end you loop through the map and get your list.

This should probably work with any kind of input. I think...

Of course this can be done better than this. If anybody has suggestions, please feel free to share them.

public static void main(String[] args) {
    List<String> list = Arrays.asList("document", "document (1)", "document (2)", "document (3)", "mypdf (1)", "mypdf", "myspreadsheet (1)",
            "myspreadsheet", "myspreadsheet (2)");

    Map<String, Integer> counterMap = new HashMap<>();
    List<String> newList = new ArrayList<>();

    for (String item : list) {
        if (item.indexOf(')') != -1) {
            String namePart = item.substring(0, item.indexOf('(')).trim();
            Integer numberPart = Integer.parseInt(item.substring(item.indexOf('(') + 1, item.indexOf(')')));

            Integer existingValue = counterMap.get(namePart);
            if (existingValue != null) {
                if (numberPart > existingValue) {
                    counterMap.put(namePart, numberPart);
                }
            } else {
                counterMap.put(namePart, numberPart);
            }
        } else {
            newList.add(item);
        }

    }

    Iterator<Entry<String, Integer>> iterator = counterMap.entrySet().iterator();
    while (iterator.hasNext()) {
        Entry<String, Integer> next = iterator.next();
        String key = next.getKey();
        Integer value = next.getValue();
        if (newList.contains(key)) {
            newList.remove(key);
        }

        newList.add(key + " (" + value + ")");
    }

    System.out.println(newList);

}

Grzegorz Piwowarek · Answer 6 · 2016-11-17T15:10:14.813

Let's utilize the Stream API to group our documents and simply pick the newest revision by sorting Strings by the revision number. Keep in mind that those static methods were implemented poorly because you did not give us too much information about the naming strategy but the idea should be clear.

Algorithm:

Group revisions of the same String together
Pick the number with the highest version from each group

Solution:

    Map<String, List<String>> grouped = input.stream()
      .collect(Collectors.groupingBy(preprocessedString(), Collectors.toList()));

    List<String> finalResult = grouped.entrySet().stream()
      .map(e -> e.getValue().stream()
        .max(Comparator.comparing(revisionNumber())).get()) //at this point we have at least one element
      .collect(Collectors.toList());


}

Helper parsing functions:

private static Function<String, Integer> revisionNumber() {
    return s -> s.contains("(") ? Integer.valueOf(s.substring(s.indexOf('(') + 1, s.indexOf(')'))) : 0;
}

private static Function<String, String> preprocessedString() {
    return s -> s.contains("(") ? s.substring(0, s.lastIndexOf("(")).trim() : s.trim();
}

Input:

List<String> input = Arrays.asList(
      "document",
      "document (1)",
      "document (2)",
      "document (3)",
      "mypdf (1)",
      "mypdf",
      "myspreadsheet (12)",
      "myspreadsheet",
      "myspreadsheet (2)",
      "single");

Result: [single, myspreadsheet (12), document (3), mypdf (1)]

score 0 · Answer 7 · answered Nov 17 '16 at 15:37

We do not need actually to know if the element contains more then one whitespace or whatever. We can aways start from the end and check if the elements is duplicate or not (see if there is a ")" or not).

Also interating once through the List is enough to get all the information we need. Assuming that, I am providing a solution which saves the highest appearance value as a VALUE in a Map which map will have as KEYs all elements in the given input list.

After that you can create your result List with one more iteration through the Map.

public List<String> removeDuplicates(List<String> inputArray) {                                                               
    Map<String, Integer> map = new HashMap<String, Integer>();                                                                
    List<String> result = new ArrayList<String>();                                                                            

    int numberOfOcurences = 0;                                                                                                
    for (int i = 0; i < inputArray.size(); i++) {                                                                             
        String element = inputArray.get(i);                                                                                   
        if (element.charAt(element.length() - 1) == ')') {                                                                    
            numberOfOcurences = Character.getNumericValue(element.charAt(element.length() - 2));                              
            element = element.substring(0, element.length() - 4);                                                             
        } else {                                                                                                              
            numberOfOcurences = 0;                                                                                            
        }                                                                                                                     
        if (map.isEmpty()) {                                                                                                  
            map.put(element, numberOfOcurences);                                                                              
        } else {                                                                                                              
            if (null != map.get(element) && map.get(element) < numberOfOcurences) {                                           
                map.put(element, numberOfOcurences);                                                                          
            } else if (null == map.get(element)) {                                                                            
                map.put(element, numberOfOcurences);                                                                          
            }                                                                                                                 
        }                                                                                                                     
    }                                                                                                                         
    for (String a : map.keySet()) {                                                                                           
        result.add(a + " (" + map.get(a)+ ")");                                                                               
    }                                                                                                                         
    return result;                                                                                                            
}

score -5 · Answer 8 · edited Nov 17 '16 at 21:09

-5

Set<T> mySet = new HashSet<T>(Arrays.asList(Your));

I have found that from another user of stackoverflow, try if it works. Good Luck :)

edited Nov 17 '16 at 21:09

Soatl

10,224
28
95
153

answered Nov 17 '16 at 14:12

Alessandro Pierro

11
6

1

Three answers so far, all with the same wrong assumptions. Note that `document` and `document (1)` are conceptual duplicates, not literal duplicates. There needs to be some processing and accounting. – bradimus Nov 17 '16 at 14:14

How to remove similar named strings in a list?

8 Answers8