0
import java.io.*;
import java.util.*;

public class lineCount {

public static void main(String[] args) {
     Map<String,Integer>  countMap= new HashMap<String,Integer>();

        try (BufferedReader  br= new BufferedReader(new FileReader(new File("error.txt"))))
        {

            String data="";
            while ((data=br.readLine())!=null) {

                if(countMap.containsKey(data)) {
                    countMap.put(data, countMap.get(data)+1);
                }else {
                    countMap.put(data, 1);
                }

            }

            countMap.forEach((k,v)->{System.out.println("Error: "+k+" Occurs "+v+" times.");});

        } catch (IOException  e) {
            e.printStackTrace();
        }

}

}

I have the text file as below and I want to count duplicate lines by ignoring date and time in line.if date and time is not there then ok if there is date and time means we have to ignore and count.

I did all the thing but i don't know how to ignore date and time.can any one help me


text file

 ERROR  [CompactionExecutor:21454] 2018-10-29 12:02:41,906 NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB
 ERROR  [CompactionExecutor:21454] 2018-10-29 12:02:41,906 NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB
 ERROR  [CompactionExecutor:21454] 2018-10-29 12:02:41,906 NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB
 ERROR  [CompactionExecutor:21454] 2018-10-29 12:02:41,906 NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB
 ERROR  [CompactionExecutor:21454] 2018-10-29 12:02:41,906 NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB
 2018-09-20 14:08:14.571 [main] ERROR org.apache.flink.yarn.YarnApplicationMasterRunner  -     -Dlogback.configurationFile=file:logback.xml
 2018-09-20 14:08:14.571 [main] ERROR  org.apache.flink.yarn.YarnApplicationMasterRunner  -     -Dlogback.configurationFile=file:logback.xml
 ERROR  [CompactionExecutor:21454] 2018-10-29 12:02:41,906 NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB
 ERROR  [CompactionExecutor:21454] 2018-10-29 12:02:41,906 NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB
 ERROR  [CompactionExecutor:21454] 2018-10-29 12:02:41,906 NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB
 2018-10-29T12:01:00Z E! Error in plugin [inputs.openldap]: LDAP Result Code 32 "No Such Object": 
 2018-10-29T12:01:00Z E! Error in plugin [inputs.openldap]: LDAP Result Code 32 "No Such Object": 
 2018-10-29T12:01:00Z E! Error in plugin [inputs.openldap]: LDAP Result Code 32 "No Such Object": 
 2018-10-29T12:01:00Z E! Error in plugin [inputs.openldap]: LDAP Result Code 32 "No Such Object": 
 2018-10-29T12:01:00Z E! Error in plugin [inputs.openldap]: LDAP Result Code 32 "No Such Object": 
 2018-10-29T12:01:00Z E! Error in plugin [inputs.openldap]: LDAP Result Code 32 "No Such Object": 
 2018-10-29T12:01:00Z E! Error in plugin [inputs.openldap]: LDAP Result Code 32 "No Such Object": 
 2018-10-29T12:01:00Z E! Error in plugin [inputs.openldap]: LDAP Result Code 32 "No Such Object": 
 ERROR  [CompactionExecutor:21454] 2018-10-29 12:02:41,906 NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB
 2018-09-20 14:08:14.571 [main] ERROR  org.apache.flink.yarn.YarnApplicationMasterRunner  -     -Dlogback.configurationFile=file:logback.xml
 2018-09-20 14:08:14.571 [main] ERROR  org.apache.flink.yarn.YarnApplicationMasterRunner  -     -Dlogback.configurationFile=file:logback.xml
 2018-09-20 14:08:14.571 [main] ERROR  org.apache.flink.yarn.YarnApplicationMasterRunner  -     -Dlogback.configurationFile=file:logback.xml
 2018-09-20 14:08:14.571 [main] ERROR  org.apache.flink.yarn.YarnApplicationMasterRunner  -     -Dlogback.configurationFile=file:logback.xml
 2018-09-20 14:08:14.571 [main] ERROR  org.apache.flink.yarn.YarnApplicationMasterRunner  -     -Dlogback.configurationFile=file:logback.xml
 "2018-10-16 19:54:26.691 [RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428AD is not mapped to any meid. {}",2
 "2018-10-16 19:54:27.074 [RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428AD is not mapped to any meid. {}",2
 "2018-10-16 19:54:27.293 [RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428AD is not mapped to any meid. {}",2
 "2018-10-16 19:54:27.296 [RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428AD is not mapped to any meid. {}",2
 "2018-10-16 19:54:27.471 [RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428AD is not mapped to any meid. {}",2
 "2018-10-16 19:54:27.570 [RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428AD is not mapped to any meid. {}",2
 "2018-10-16 19:54:27.574 [RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428AD is not mapped to any meid. {}",2
 "2018-10-16 19:54:27.574 [RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428BD is not mapped to any meid. {}",2
 "2018-10-16 19:54:27.574 [RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428BD is not mapped to any meid. {}",2
Abhishek
  • 3,348
  • 3
  • 15
  • 34
Pavan
  • 33
  • 6
  • 2
    If you just want to know how many lines are duplicates, you could use the hashcode of a line and place it in a set, capturing the return value. When you add an object to a collection, generally if it already exists the old value is returned, else null. If the return object is not null, you know that the value already existed and was a duplicate, and you can count it as a duplicate. – adickinson Oct 31 '18 at 12:06
  • 1
    To ignore the date and time, have a look at Regular Expressions (aka "regex") for string searching. That will enable you to recognise and remove the date/time from a line if present. Then process the remaining text normally to recognise duplicates. – rossum Oct 31 '18 at 12:13
  • 1
    @Andrew Storing hashes is not safe to do. Multiple strings map to the same hash because of the pigeonhole principle. Existing algorithms (*e.g.* hashmap) that use hashes get away with it because it's a *preliminary* check, after which actual objects are compared with `equals()` when the hash matches. – Mark Jeronimus Oct 31 '18 at 12:17
  • @MarkJeronimus Do you have any sources on that which I can read? "Pigeon Hole principle" seems to be too vague to return relevant material. – adickinson Oct 31 '18 at 12:19
  • Try the first search result which is Wikipedia. For a more in-context explaination, see https://stackoverflow.com/questions/7417668/java-use-hashcode-inside-of-equals-for-convenience – Mark Jeronimus Nov 01 '18 at 11:02

2 Answers2

0

You just have to use regex to remove timestamps, you should use a stream instead so you can easily switch to a parallel stream. Try this code:

List<String> lines = Files.readAllLines(new File("error.txt").toPath());
String timestampRegex = "\\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2}(,|.)\\d{3}";
Map<String, Long> map = lines.stream().map(e -> e.replaceAll(timestampRegex, ""))
.collect(Collectors.groupingBy(e -> e, Collectors.counting()));
Michael Dz
  • 3,655
  • 8
  • 40
  • 74
0

replace data=data.trim(); with data=extractString(data.trim());

public static String extractString(String input) {

        String regEx ="(.*)([ ]*\\d{4}-\\d{2}-\\d{2}[\\s|T]\\d{2}:\\d{2}:\\d{2}(?:(?:[,.]{1}\\d{0,3})|(?:[Z]{1}))[ ]*)(.*)";
        Matcher matcher = Pattern.compile(regEx).matcher(input);            

        String output="";
        if(matcher.matches()) { 
            output= matcher.group(1)+matcher.group(3);              
        }else {
            output=input;
        }   

        return output.trim();
 }

Output :

Occurs 7 times :: "[RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428AD is not mapped to any meid. {}",2
Occurs 1 times :: [main] ERROR org.apache.flink.yarn.YarnApplicationMasterRunner  -     -Dlogback.configurationFile=file:logback.xml
Occurs 2 times :: "[RawEventProcessor (2/2)] ERROR com.qolsys.iqcloud.processing.operators.RawEventProcessor1  - processRawPanelEvent():: SerialNumber systemSerialNumber: QV01D173700428BD is not mapped to any meid. {}",2
Occurs 8 times :: E! Error in plugin [inputs.openldap]: LDAP Result Code 32 "No Such Object":
Occurs 6 times :: [main] ERROR  org.apache.flink.yarn.YarnApplicationMasterRunner  -     -Dlogback.configurationFile=file:logback.xml
Occurs 9 times :: ERROR  [CompactionExecutor:21454] NoSpamLogger.java:91 - Maximum memory usage reached (125.000MiB), cannot allocate chunk of 1.000MiB

Below is the clarification for the output given here.

  • 1st & 3rd Line may look similar, but there is a difference in string QV01D173700428AD and QV01D173700428BD in those lines
  • 2nd and 5th Line may also look similar but there is a difference of space character in the given input after the word ERROR and before org.apache.flink.yarn

    [main] ERROR org.apache.flink.yarn

Breaking the Regex

Each Log record is separated into 3 Groups, with the 2nd Group representing the Timestamp. and 1st & 3rd Group represent any characters.

In the above error log, there are three types of timestamps,

2018-10-29 12:02:41,906

2018-09-20 14:08:14.571

2018-10-29T12:01:00Z

Below is the breaking of the RegEx

Group 1 - Matches any Character (.*)

Group 2 - Matches the Timestamp ([ ]*\\d{4}-\\d{2}-\\d{2}[\\s|T]\\d{2}:\\d{2}:\\d{2}(?:(?:[,.]{1}\\d{0,3})|(?:[Z]{1}))[ ]*)

Group 3 - Matches any Character (.*)

In the java code, Group 2 [Timestamp] is ignored and only Group 1 & 2 were considered

Breaking Group 2 :


[ ]* 0 or more spaces

\\d{4} 2 numeric character

- a hyphen

\\d{2} 2 numeric character

- a hyphen

\\d{2} 2 numeric character

[\\s|T] a space or a character 'T'

\\d{2} 2 numeric character

: a colon

\\d{2} 2 numeric character

: a colon

\\d{2} 2 numeric character

(?:(?:[,.]{1}\\d{0,3})|(?:[Z]{1})) To match either ',' or '.' followed by 3 numeric character OR a character 'Z'

[ ]* 0 or more spaces


Here ( ) is a capturing group and (?: ) is the non capturing group. To learn more about groupings in RegEx. Please refer https://www.regular-expressions.info/refcapture.html or other resources