-1

I have several hundred HTML files (Pidgin IM log files) that have exactly the same format:

<html>
    <head><meta ...><title>...</title></head>
    <body>
        <h3>...</h3>
        <font color=...><font ...>(TIME)</font> <b>(NAME):</b></font> (MESSAGE)<br/>
        <font color=...><font ...>(TIME)</font> <b>(NAME):</b></font> (MESSAGE)<br/>
        <font color=...><font ...>(TIME)</font> <b>(NAME):</b></font> (MESSAGE)<br/>
        ...

(no closing body/html tags, it just repeats those lines until EOF)

I need to extract the time, name and messages from these files. I'm not great with regex and the HTML libraries I've tried seem a bit complex for what I'm trying to do. Any suggestions?

Corey
  • 123
  • 7
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Luiggi Mendoza May 09 '14 at 02:00
  • Does [this](https://stackoverflow.com/questions/832620/stripping-html-tags-in-java?rq=1) get you a little closer? – saiarcot895 May 09 '14 at 02:00
  • Do the message or name fields contain html? – Zeki May 09 '14 at 02:36

2 Answers2

0

If this is a specific need, and the format really is this regular, I would do it with simple indexOf:

String[] lines=readFile(...);
for(String lin: lines) {
    int str,end;
    if((str=lin.indexOf("<font "     ))!=-1 
    && (str=lin.indexOf("<font " ,str))!=-1 
    && (str=lin.indexOf(">"      ,str))!=-1 
    && (end=lin.indexOf("</font>",str))!=-1) {
        str++;
        time=lin.substring(str,end);

        if((str=lin.indexOf("<b>"  ,end))!=-1) {
        && (end=lin.indexOf(":</b>",str))!=-1) {
            str+=3;
            name=lin.substring(str,end);

            if(... and so on
            }
        }
    }

(note this code is off the cuff, uncompiled and untested, intended to convey the basic idea)

Lawrence Dol
  • 63,018
  • 25
  • 139
  • 189
0

I was able to use regex to solve the issue.

Pattern correct = Pattern.compile("\\<font color=.*?\\>", 0);
Pattern replace = Pattern.compile("\\</?(font|b|br/)( +.*?)?\\>", 0);

for (String s : Files.readAllLines(myfile)) {
    if (correct.matcher(s).matches() && replace.matcher(s).matches()) {
        String text = replace.matcher(s).replaceAll("");

        String time = text.substring(1, text.indexOf(')'));

        int offset = text.indexOf(':');
        offset = text.indexOf(':', offset + 1);
        int result = text.indexOf(':', offset + 1);

        String name = text.substring(text.indexOf(')') + 2, result);
        String message = text.substring(result + 2).trim();

        // do stuff with time, name and message
    }
}
Corey
  • 123
  • 7