1

I have to find the commonly occuring IP addresses from apache logs.

12.1.12.1 9000 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

12.1.12.1 9000 192.145.1.23 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

How do I extract the IP addresses (i.e. 3rd word in each line) using regular expressions in Java? Also i have to find most common IP Addresses from it, for finding out robotic access. The log contains millions of lines, so regexp may be suitable for this.

Anand
  • 1,287
  • 3
  • 11
  • 15
  • 1
    Why bother with a regex? Just take the substring between the 2nd and 3rd spaces. – Dan Grossman Feb 09 '11 at 08:04
  • I have to take it from millions of lines.It will become slow.. – Anand Feb 09 '11 at 09:57
  • No Anand, if you take it from millions of lines it will be fast, because regular expressions have more overhead than simply finding the index of the 2nd and 3rd space, then directly accessing the substring. – Michael Dillon Feb 19 '11 at 07:15

4 Answers4

3

If you are certain that it is always the 3rd word (as you said), maybe you don't need regular expressions at all. You could just take the third word via a simple split.

However, someone asked already that: Regular expression to match DNS hostname or IP Address?...

Community
  • 1
  • 1
chahuistle
  • 2,627
  • 22
  • 30
3

As others have pointed out, you don't need regexes. You shouldn't use String.split either, since it uses regexes as well. You could use StringTokenizer instead. Assuming you use BufferedReader br to read in each line:

String line = br.readLine();
StringTokenizer st = new StringTokenizer(line, " ");
st.nextToken();
st.nextToken();
String ip = st.nextToken();
Lucas Zamboulis
  • 2,494
  • 5
  • 24
  • 27
  • Apparently can't comment on top post so I'll comment here. To find the most common IP you need to maintain the count of each IP somewhere, either in a hashmap, or (since this may be too much for memory) on disk. I don't see how regexes will make it faster and less memory intensive to find the most common IP. – Lucas Zamboulis Feb 09 '11 at 10:04
1

Here is one solution:

String str1 = "12.1.12.1 9000 127.0.0.1 - frank [10/Oct/2000:13:55:36"
            + " -0700] \"GET /apache_pb.gif HTTP/1.0\" 200 2326 "
            + "\"http://www.example.com/start.html\" \"Mozilla/4.08 "
            + "[en] (Win98; I ;Nav)\"";

String str2 = "12.1.12.1 9000 192.145.1.23 - frank [10/Oct/2000:13:55"
            + ":36 -0700] \"GET /apache_pb.gif HTTP/1.0\" 200 2326 "
            + "\"http://www.example.com/start.html\" \"Mozilla/4.08 "
            + "[en] (Win98; I ;Nav)\"";

Pattern p = Pattern.compile("\\S+\\s+\\S+\\s+(\\S+).*");

Matcher m = p.matcher(str1);
if (m.matches())
    System.out.println(m.group(1));

m = p.matcher(str2);
if (m.matches())
    System.out.println(m.group(1));

Reg-ex breakdown:

  • \S+, one or more non-white space characters.
  • \s+, one or more white space characters.
  • ...
  • (\\S+) one or more non-white space characters, captured in group 1.
aioobe
  • 413,195
  • 112
  • 811
  • 826
0

The format of the access log file always depends on the configuration file settings. It would be probably better instead of assuming that the IP-address is the third 'word', to read the current configuration file and parse the access log file accordingly to the LogFormat entry.

Apache httpd operates in accordance to httpd.conf and Tomcat to server.xml. server.xml is an XML file and that makes parsing the AccessLogValve a standard procedure.

This is a little more work, but it will make your application more flexible, in case it is necessary to persist. For this approach, i think, string methods will be easier to use than regular expressions.

Costis Aivalis
  • 13,680
  • 3
  • 46
  • 47