22

My company's proprietary software generates a log file that is much easier to use if it is parsed. The log parser we all use was written by another employee as a side project, and it has horrible performance.

These log files can grow to 10s of megabytes very quickly, and the parser we currently use has issues if a log file is bigger than 1 megabyte.

So, I want to write a program that can parse this massive amount of text in the shortest amount of time possible. We use Windows exclusively, so running on Windows is a must. Our current implementation runs on a local web server, and I'm convinced that running it as an application would have to be faster.

All suggestions will be helpful. Thanks.

EDIT: My ultimate goal is to parse the text and display it in a much more user friendly manner with colors and such. Can you do this with Perl and Python? I know you can do this with Java and C++. So, it will function like Notepad where you open a log file, but on the screen you display the user-friendly format instead of the raw file.

EDIT: So, I cant choose the best answer, and that was to choose a language that can best display what I'm going for, and then write the parser in that. Also, using ANTLR will probably make this process much easier. I changed the original question, since I guess I didn't ask what I was really looking for. Thanks everyone!

HenryAdamsJr
  • 850
  • 2
  • 9
  • 21
  • 1
    We'd need a bit more information to be able to help you. Some log sample would be nice, as well as how would you like it parsed. – Marcos Placona Mar 25 '10 at 21:59
  • As for how I want to parse it, I've basically described that in my edit above. As for the log file itself, I don't need help with the parsing, just in choosing the best tool to do it. – HenryAdamsJr Mar 25 '10 at 22:35
  • 2
    You should probably also pick a language where displaying the text the way you want it is easy. Displaying might be more involved than the parsing itself. – meriton Mar 25 '10 at 22:40
  • 1
    If you're trying to do anything more than just display the text (have buttons/menus/dialogs to customize the parsing, let the user edit the text, whatever), meriton has it exactly right - pick your language/library for writing your display first. You'll be able to get the text parsed no matter what you pick. – Cascabel Mar 25 '10 at 23:44
  • Why would running as an application be faster that running through a web server? – Duncan Mar 26 '10 at 00:32
  • See [best-language-for-string-manipulation](http://stackoverflow.com/questions/635155/best-language-for-string-manipulation) – nawfal Jul 20 '14 at 21:03

12 Answers12

17

Hmmm, "go with what you know" was a good answer. Perl was designed for this sort of thing (but imo is well suited for simple parsing, but I'd personally avoid it for complex projects).

If it gets even a little complex, why not use a proper syntax and grammar set-up?

Lex & Yacc (or Flex & Bison) spring to mind, but personally I would always reach for Antlr

Define various "words" in terms of patterns (syntax), and rules to combine those words (grammar) and Antlr will spit out a program to parse your input (you can have the program in Java, C, C++ and more (you are worried about parse time, so choose a compiled language, of course)).

I personally find it tedious to hand-craft parsers, and even more tedious to debug them, but AntlrWorks is a lovely IDE which really makes it a piece of cake ...

That bit at the bottom is defining a grammar rule.

If you mess up your grammar rules, you will be informed. This is not the case with hand-crafted parsers, where you just scratch your body part and wonder about the "strange results"...

Check it out. Even if you think your project is trivial now, it may well grow. And if you have any interest in parsing you do owe it to yourself to at least be familiar with lex/yacc, but especially Antlr(Works)

Willi Mentzel
  • 27,862
  • 20
  • 113
  • 121
Mawg says reinstate Monica
  • 38,334
  • 103
  • 306
  • 551
11

You should use the language that YOU know... Unless you have so much time available to complete the project that you can also spend the time learning a new language.

David
  • 121
  • 1
  • 4
  • 2
    This is ALWAYS the correct answer when the question is "What language should I use to do X?" Even if the language isn't great for what you're doing, if you don't know a better one you're better off sticking to what you know for serious projects. – Billy ONeal Mar 25 '10 at 23:09
  • 1
    That's a great suggestion, and if this was needed within a certain timeframe, I would agree, but I was going to use this project as an excuse to learn something new. Reading all of the answers makes it seem like the language isn't going to make this faster or slower to a great extent. I'm currently leaning towards C++ since I know I can create a Windows GUI with it, and I want to add it to my repertoire. – HenryAdamsJr Mar 25 '10 at 23:15
  • 1
    Have you considered some of us know more than one language already and would like a more specific answer? :p – Nick Bull Oct 09 '18 at 16:43
8

I would suggest using Python or Perl. Parsing large text files with regular expressions is really fast.

compie
  • 10,135
  • 15
  • 54
  • 78
4

Whatever language your coworker used.

(I could tell you that any macro assembler will let you write code that would rip through your data, but seriously, are you going to spend months writing assembly just to save a few seconds of CPU time? Rewriting a program is fun but it's not practical.)

Whip out your profiler, point it at your horribly performing log parser, and fix the performance problems. If it's a common language, there will be people here who can help.

Ken
  • 1,261
  • 2
  • 9
  • 7
  • It wouldn't save a few seconds. If I do it right, it will literally save minutes. With the current implementation, if the file is sufficiently big, it won't return at all. I feel that his implementation is wrong from the ground up, and I don't have access to the source code, anyway. – HenryAdamsJr Mar 26 '10 at 14:48
4

Parse this massive amount of text in the shortest time possible.

Consider the PADS Project from AT&T. It's a special-purpose language, compatible with C, that's designed exactly for high-speed parsing of log files and other ad hoc data formats. There's even a feature where it can try to learn your log format from examples, although I don't know if that has hit production yet. The people behind the project are really smart, and it's had a big impact within the phone company. PADS gives very high performance on data streams that produce gigabytes. Joe Bob says check it out.

If "massive text in the shortest time possible", Perl and Python are not the answer. But if you need to whip up something not too slow, and it's OK to take longer, Perl and Python could be OK. Tems of megabytes is not actually that big.

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
3

I've used both Python and Perl. Perl is a more natural fit for this but can be hard to maintain. Python will do it just as well and is easier to read. Go for Python.

Jordan Parmer
  • 36,042
  • 30
  • 97
  • 119
  • 4
    But all the $@% are so beautiful! Go for perl! – Cascabel Mar 25 '10 at 22:07
  • 2
    @Jefromi - Ha! There's nothing like coming back to 200 lines of symbol soup months later trying to figure out what the heck you were thinking. =) – Jordan Parmer Mar 25 '10 at 22:30
  • I added some information to the post to clarify how I'm going to be using the parsed text. I want to have a GUI that will display the log, but in a friendly format. I don't think I've ever seen a Windows GUI app written using Perl or Python, but I know very little about them. – HenryAdamsJr Mar 25 '10 at 22:41
  • @j0rd4n: That just means you get to have all the fun twice! – Cascabel Mar 25 '10 at 22:58
  • I've never used it, but the Tk GUI library comes installed with Python on all platforms, including Windows. – ptomato Mar 25 '10 at 23:31
  • Any language can be hard to maintain, and irrespective of scripting language, you'll end up writing obscure regexes. I haven't used Python, but since Perl = Practical Extraction & Reporting Language and it has it's roots in text processing, it seems like a natural choice. – Duncan Mar 25 '10 at 23:50
  • @Duncan, then you should try using Python. with Python, you write less regexes, because Python's string manipulation capabilities is often more than enough to do the job. – ghostdog74 Mar 25 '10 at 23:55
  • @Duncan - Write Python and you'll never go back. I was a die-hard Perl fan but what ghostdog74 said is indeed true. – Jordan Parmer Mar 26 '10 at 02:55
  • @ghostdog74,@j0rd4a: Perhaps I am too old and set in my ways ... or more accurately a vast bulk of the code I have to maintain is written in Perl and none of it in Python. So there is no compelling reason to learn it as yet. Although, I'm sure it would help understanding some of the later O'Reilly books :) – Duncan Mar 26 '10 at 04:19
  • @Duncan - Totally understand. :) – Jordan Parmer Mar 26 '10 at 10:30
2

I believe perl is considered a good choice to parse text.

Gratzy
  • 9,164
  • 4
  • 30
  • 45
2

Maybe a finished product such as the MS LogParser (usage podcast here) may do what you need and it's free.

Lucero
  • 59,176
  • 9
  • 122
  • 152
  • 1
    I definitely would recommend looking at existing free or commercial products to solve the problem, no need to reinvent the wheel. Splunk is a popular log parsing and analysis tool that can accept arbitrary input: http://www.splunk.com/base/Documentation/latest/Admin/WhatSplunkCanMonitor – Greg Bray Mar 25 '10 at 22:26
1

Perl is good for text processing.

A number of very good text processing programs have been written in Perl. Ack (a grep replacement) is one.

David Johnstone
  • 24,300
  • 14
  • 68
  • 71
0

Sounds like a job for Perl, much as I don't particularly care for it as a language myself. ActivePerl is a reasonable distribution of Perl for Windows.

Donal Fellows
  • 133,037
  • 18
  • 149
  • 215
0

I'd suggest Perl. It was practically built for parsing log files. As for output I agree with ghostdog74, HTML is the way to go. Perl has dozens of modules that allow you to build and/or template HTML.

I'd parse out the data using regular expressions, then use Template::Toolkit (on CPAN) to create nice pages using HTML and CSS templates.

Matthew S
  • 117
  • 4
-2

c/c++ or java... for c/c++ i have snippet that might help you:

FILE *f = fopen(file, "rb");
if(f == NULL) {
    return DBDEMON_OPEN_ERROR; // open fail
}

for(int i = 0; feof(f) == 0; i++)

{

fscanf(f,"%d %s %s %c\n",  &db[i].id, &db[i].name[0], &db[i].uid[0], &db[i].priviledge);

db_size++;

}

fclose(f);

this is reading a file with the following format:

int string string char

1 SOMETHING ANYTHING Z

to a struct define as follows:

typedef struct {

    unsigned int    id;
    char        name[DBDEMON_NAME_MAXSIZE];
    char        uid[DBDEMON_UID_MAXSIZE];
    char        priviledge;
} DATABASE;

Use fscanf with care, since no types are checked, etc, it can result in errors. But I think this is pretty efficient.

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
luis
  • 345
  • 1
  • 3
  • 4
  • 7
    I am a C/C++ advocate -- and even I'd not call them great languages for text processing. – Billy ONeal Mar 25 '10 at 23:07
  • @Billy - So, C++ doesn't process text well? Would that be balanced out by how it can easily create a Windows GUI, or not? – HenryAdamsJr Mar 25 '10 at 23:18
  • 3
    No, it does not. Strings are not native types on C++, and the language does not have built-in constructs like regular expressions, `starts_with`, case insensitive comparisons, substring, trimming, splitting, etc. Almost every other languages includes these as part of the language. – Billy ONeal Mar 25 '10 at 23:29
  • He wanted something fast & simple. I think this is both. It's not a discussion about being good or bad. lol – luis Mar 26 '10 at 09:19