1

Given a list/array of strings:

document
document (1)
document (2)
document (3)
mypdf (1)
mypdf
myspreadsheet (1)
myspreadsheet
myspreadsheet (2)

How do I remove all the duplicates but retain only the highest copy number?

Ending result to be:

document (3)
mypdf (1)
myspreadsheet (2)
Grzegorz Piwowarek
  • 13,172
  • 8
  • 62
  • 93

8 Answers8

3

You put in a broad question, so here comes an unspecific (but nonetheless) "complete" answer:

  1. Iterate over all your strings to identify all lines that contain braces.
  2. In other words: identify all the strings that look like "X (n)"
  3. Then, for each "different" X that you found, you can iterate the list again; so that you can find all occurrences of "X", X (1)", .. and so on
  4. Doing so will allow you to detect the maximum n for each of your Xes.
  5. Push that "maximum" "X (n)" into your results list.

In other words: it only takes such a simple receipt to solve this problem; now it only takes your time to turn these pseudo-code instructions into real code.

For the record: if the layout of your file is really as shown above, then things become a bit easier - as it seems that your numbers are just increasing. What I mean is:

X (1)
X (2)
X (3)

is easier to treat than

X (1)
X (3)
X (2)

As in your case, it seems save to assume that the last X(n) contains the largest n. Which makes using a HashMap (as suggested by cainiaofei) a nice solution.

GhostCat
  • 137,827
  • 25
  • 176
  • 248
1

an alternative solution


use a HashMap the key is the name (e.g. the name of document document (1) document (2) document (3) are all document)

which can be implement by this code str.substring(0,str.indexOf('(')).trim()

and the value is the times that key present, at last traverse the map get the key that corresponding value is max and the result is key(value-1)

nail fei
  • 2,179
  • 3
  • 16
  • 36
  • Your solution simply pushes the **last** number found into the map. Which works if the format is really like the above ... but what if not?! – GhostCat Nov 17 '16 at 14:20
  • appreciated! just looking for a logical approach – Zachary Loughridge Nov 17 '16 at 14:20
  • @GhostCat Sorry, I am afraid that I dont understand what you mean. AS for the example of OP, my map will be <[document,4],[mypdf,2],[myspreadsheet,3]> – nail fei Nov 17 '16 at 14:33
  • @cainiaofei I mean: when you just read the file in sequence, and push document name and counter into the map ... that only works if the "counter" values are always **increasing**. – GhostCat Nov 17 '16 at 15:28
  • @GhostCat it has nothing with whether 'counter values' are increasing. When put the String into map, I only care the subString before *(counter)* I called it 'name' ,[*e.g. the name of a(1) and a is the same a*].So the 'order' *a(2) a a(1)* and *a a(1) a(2)* is same to me. Reading every string my operation is 'map.put(name,map.get(name+1))' – nail fei Nov 18 '16 at 01:06
0

I would advice you to use a dictionnary :

Map<String, Integer> dict = new HashMap<>();
for (String s : listOfInput){
    String name = s.split(" ")[0];
    String version = s.split(" ")[1].charAt(1);
    if(dict.get(name)!=null){
        if (Integer.parseInt(version) < dict.get(name)){
            continue;
        }
    }
    dict.put(name, version); 
}

The data would be at the end in the dictionary:

key                     | value

document           | 3

mypdf                 | 1

myspreadsheet | 2

L01c
  • 1,033
  • 10
  • 19
0

Here is a possible approach, but this will only work if the version number doesn't exceed 9 (*) :

1) Sort the list in reverse order, so that the most recent version appears first

(*) The sorting being based on alphabetical order , you should be quite fine unless your version number exceeds one digit.Because 10 for instance, appears before 9 with an alphabetical sorting.

Your list will turn into :

myspreadsheet (2)
myspreadsheet (1)
myspreadsheet
mypdf (1)
mypdf
document (3)
document (2)
document (1)
document

2) Iterate on the list, and only keep the first occurence of a given document (i.e the most recent thanks to the reverse sorting)

3) If you want to, sort back the remaining list to a more natural ordering

    List<String> documents = new ArrayList<String>();

    documents.add("document");
    documents.add("document (1)");
    documents.add("document (2)");
    documents.add("document (3)");
    documents.add("mypdf (1)");
    documents.add("mypdf");
    documents.add("myspreadsheet (1)");
    documents.add("myspreadsheet");
    documents.add("myspreadsheet (2)");

    // 1) Sort in reverse order, so that the most recent document version appears first
    Collections.sort(documents, Collections.reverseOrder());

    String lastDocumentName = "";

    ListIterator<String> iter = documents.listIterator();

    // 2)
    while (iter.hasNext()) {

        String document = iter.next();

        // Store the first part of the String , i.e the document name (without version)
        String firstPart = document.split("\\s+")[0];

        // Check if this document is a version of the last checked document
        // If it is the case, this version is anterior, remove it from the list
        if (lastDocumentName.equals(firstPart)) {

            iter.remove();

        }

        // Store this document's name as the last one checked
        lastDocumentName = firstPart;

    }

    // 3) Sort back to natural order
    Collections.sort(documents);

    for (String doc : documents) {

        System.out.println(doc);
    }
Arnaud
  • 17,229
  • 3
  • 31
  • 44
0

This is a simple solution by making use of a Map. First you loop through your list, you split the String, and you add it to the map with the name as the key, and what's inside the paranthesis as a value. And for each entry you check if the key already exists. And if the key exists you compare the value and you add the next entry to the map if the value is bigger than what's already stored. At the end you loop through the map and get your list.

This should probably work with any kind of input. I think...

Of course this can be done better than this. If anybody has suggestions, please feel free to share them.

public static void main(String[] args) {
    List<String> list = Arrays.asList("document", "document (1)", "document (2)", "document (3)", "mypdf (1)", "mypdf", "myspreadsheet (1)",
            "myspreadsheet", "myspreadsheet (2)");

    Map<String, Integer> counterMap = new HashMap<>();
    List<String> newList = new ArrayList<>();

    for (String item : list) {
        if (item.indexOf(')') != -1) {
            String namePart = item.substring(0, item.indexOf('(')).trim();
            Integer numberPart = Integer.parseInt(item.substring(item.indexOf('(') + 1, item.indexOf(')')));

            Integer existingValue = counterMap.get(namePart);
            if (existingValue != null) {
                if (numberPart > existingValue) {
                    counterMap.put(namePart, numberPart);
                }
            } else {
                counterMap.put(namePart, numberPart);
            }
        } else {
            newList.add(item);
        }

    }

    Iterator<Entry<String, Integer>> iterator = counterMap.entrySet().iterator();
    while (iterator.hasNext()) {
        Entry<String, Integer> next = iterator.next();
        String key = next.getKey();
        Integer value = next.getValue();
        if (newList.contains(key)) {
            newList.remove(key);
        }

        newList.add(key + " (" + value + ")");
    }

    System.out.println(newList);

}
Andrei Olar
  • 2,270
  • 1
  • 15
  • 34
0

Let's utilize the Stream API to group our documents and simply pick the newest revision by sorting Strings by the revision number. Keep in mind that those static methods were implemented poorly because you did not give us too much information about the naming strategy but the idea should be clear.

Algorithm:

  1. Group revisions of the same String together
  2. Pick the number with the highest version from each group

Solution:

    Map<String, List<String>> grouped = input.stream()
      .collect(Collectors.groupingBy(preprocessedString(), Collectors.toList()));

    List<String> finalResult = grouped.entrySet().stream()
      .map(e -> e.getValue().stream()
        .max(Comparator.comparing(revisionNumber())).get()) //at this point we have at least one element
      .collect(Collectors.toList());


}

Helper parsing functions:

private static Function<String, Integer> revisionNumber() {
    return s -> s.contains("(") ? Integer.valueOf(s.substring(s.indexOf('(') + 1, s.indexOf(')'))) : 0;
}

private static Function<String, String> preprocessedString() {
    return s -> s.contains("(") ? s.substring(0, s.lastIndexOf("(")).trim() : s.trim();
}

Input:

List<String> input = Arrays.asList(
      "document",
      "document (1)",
      "document (2)",
      "document (3)",
      "mypdf (1)",
      "mypdf",
      "myspreadsheet (12)",
      "myspreadsheet",
      "myspreadsheet (2)",
      "single");

Result: [single, myspreadsheet (12), document (3), mypdf (1)]

Grzegorz Piwowarek
  • 13,172
  • 8
  • 62
  • 93
0

We do not need actually to know if the element contains more then one whitespace or whatever. We can aways start from the end and check if the elements is duplicate or not (see if there is a ")" or not).

Also interating once through the List is enough to get all the information we need. Assuming that, I am providing a solution which saves the highest appearance value as a VALUE in a Map which map will have as KEYs all elements in the given input list.

After that you can create your result List with one more iteration through the Map.

public List<String> removeDuplicates(List<String> inputArray) {                                                               
    Map<String, Integer> map = new HashMap<String, Integer>();                                                                
    List<String> result = new ArrayList<String>();                                                                            

    int numberOfOcurences = 0;                                                                                                
    for (int i = 0; i < inputArray.size(); i++) {                                                                             
        String element = inputArray.get(i);                                                                                   
        if (element.charAt(element.length() - 1) == ')') {                                                                    
            numberOfOcurences = Character.getNumericValue(element.charAt(element.length() - 2));                              
            element = element.substring(0, element.length() - 4);                                                             
        } else {                                                                                                              
            numberOfOcurences = 0;                                                                                            
        }                                                                                                                     
        if (map.isEmpty()) {                                                                                                  
            map.put(element, numberOfOcurences);                                                                              
        } else {                                                                                                              
            if (null != map.get(element) && map.get(element) < numberOfOcurences) {                                           
                map.put(element, numberOfOcurences);                                                                          
            } else if (null == map.get(element)) {                                                                            
                map.put(element, numberOfOcurences);                                                                          
            }                                                                                                                 
        }                                                                                                                     
    }                                                                                                                         
    for (String a : map.keySet()) {                                                                                           
        result.add(a + " (" + map.get(a)+ ")");                                                                               
    }                                                                                                                         
    return result;                                                                                                            
}   
Lazar Lazarov
  • 2,412
  • 4
  • 26
  • 35
-5
Set<T> mySet = new HashSet<T>(Arrays.asList(Your));

I have found that from another user of stackoverflow, try if it works. Good Luck :)

Soatl
  • 10,224
  • 28
  • 95
  • 153
  • 1
    Three answers so far, all with the same wrong assumptions. Note that `document` and `document (1)` are conceptual duplicates, not literal duplicates. There needs to be some processing and accounting. – bradimus Nov 17 '16 at 14:14