2

I have a log text file (*.txt) which approx 2.5 millions of entries using C languaje and I have to read it and write into other file with a specific format.

File that must be read is like:

202.32.92.47 - - [01/Jun/1995:00:00:59 -0600] "GET /~scottp/publish.html" 200 271 - -
ix-or7-27.ix.netcom.com RFC-1413 John Thomas [01/Jun/1995:00:02:51 -0600] "GET /~ladd/ostriches.html" 200 205908 - "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" 
ppp-4.pbmo.net - John Thomas [07/Dec/1995:13:20:28 -0600] "GET /dcs/courses/cai/html/introduction_lesson/index.html HTTP/1.0" 500 - "http://www.wikipedia.org/" "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" 
ppp-4.pbmo.net - - [07/Dec/1995:13:20:37 -0600] "GET /dcs/courses/cai/html/index.html HTTP/1.0" 500 4528 - - 
lbm2.niddk.nih.gov RFC-1413 - [07/Dec/1995:13:21:03 -0600] "GET /~ladd/vet_libraries.html" 200 11337 "http://www.wikipedia.org/" - 

The format of each line of this log (original) file is: IP ID NAME [DATE:TIME TIMEZONE] "METHOD DIR" STATUS MB "WEB" "FROM". So, I will split previous log example using || for a better visualization:

|| ix-or7-27.ix.netcom.com || RFC-1413 || John Thomas || [01/Jun/1995 || :00:02:51 || -0600] || "GET || /~ladd/ostriches.html" || 200 || 205908 || - || "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" ||
|| ppp-4.pbmo.net || - || John Thomas || [07/Dec/1995 || :13:20:28 || -0600] || "GET || /dcs/courses/cai/html/introduction_lesson/index.html HTTP/1.0" || 500 || - || "http://www.wikipedia.org/" || "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" ||
|| ppp-4.pbmo.net || - || - || [07/Dec/1995 || :13:20:37 || -0600] || "GET || /dcs/courses/cai/html/index.html HTTP/1.0" || 500 || 4528 || - || - ||
|| lbm2.niddk.nih.gov || RFC-1413 || - || [07/Dec/1995 || :13:21:03 || -0600] || "GET || /~ladd/vet_libraries.html" || 200 || 11337 || "http://www.wikipedia.org/" || - ||

So, for example, for the first line:

IP = ix-or7-27.ix.netcom.com 
ID = RFC-1413 
NAME = John Thomas 
DATE = 01/Jun/1995
TIME = 00:02:51 
TIMEZONE = -0600 
METHOD = GET 
DIR: /~ladd/ostriches.html
STATUS = 200 
MB = 205908 
WEB = -
FROM = Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)

( Each field's value can be text or - ).

The expected output is:

ix-or7-27.ix.netcom.com | RFC-1413 | John Thomas | 01/Jun/1995 | 00:02:51 | -06 | GET | /~ladd/ostriches.html | 200 || 205908 | - | Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)
ppp-4.pbmo.net | - | John Thomas || 07/Dec/1995 | 13:20:28 | -06 | GET | /dcs/courses/cai/html/introduction_lesson/index.html HTTP/1.0 | 500 | - | http://www.wikipedia.org/ | Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5) 
ppp-4.pbmo.net | - | - || 07/Dec/1995 | 13:20:37 | -06 | GET | /dcs/courses/cai/html/index.html HTTP/1.0 | 500 || 4528 | - | - 
lbm2.niddk.nih.gov | RFC-1413 || - | 07/Dec/1995 | 13:21:03 | -06 | GET | /~ladd/vet_libraries.html | 200 | 11337 | http://www.wikipedia.org/ | - 

So, the format is split the original line and add | between each field. Each field can be:

  • First parameter (IP): catch all up to space
  • Second parameter (ID): catch all up to space (can be a string or a -)
  • Third parameter (NAME): catch all up to [ (can be a string with spaces or a -)
  • Fourth parameter (DATE): catch all up to :
  • Fifth parameter (TIME): catch all up to space
  • Sixth parameter (TIMEZONE): catch all up to ] (-dddd must be converted in -dd)
  • Seventh parameter (METHOD): catch all up to space
  • Eigth parameter (DIR): catch all up to space
  • Ninth parameter (STATUS): catch all up to space
  • Tenth parameter (MB): catch all up to space
  • Eleventh parameter (WEB): catch all inside "" (or -)
  • Twelveth parameter (FROM): catch all inside "" (or -)

Any idea how could I got it?

Thank you.


EDIT 1:

The code I use for reading/writting file is:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    // variables
    int line[255];
    char *token;

    // open files
    FILE *fpr = fopen("myLogFile.txt","r");
    FILE *fpw = fopen("myFormattedLogFile.txt","w");

    // read file
    while (fgets(line, 255, fpr) != NULL) {
        token = strtok(line, " ");
        while (token != NULL) {
            // write file
            fprintf(fpw, "%s | ", token);
            token = strtok(NULL, " ");
        }
        fprintf(fpw, "\n");
    }

    // close files
    fclose(fpr);
    fclose(fpw);

    return 0;
}

But it does not work due to takes as two values John Thomas, I do not know how can I set the correct format (remove [, ], ", change number format, split date and time, control if is string or -, ...).


EDIT 2: @CHUX'S SOLUTION

I have a dudes:

// 6º pattern. How can I recover it as string?
// 7º pattern. How can I remove first "?
// 8º patter. How can I remove last "?
// how could catch all inside "" ? Which pattern should I use?
// what is variable n?
// what is Invalid_Input? It appears as undeclared

The code updated after your solution is:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LINE_LENGTH 255

// First parameter (IP): catch all up to space
#define IP_FMT "%s"
char IP[LINE_LENGTH];

// Second parameter (ID): catch all up to space (can be a string or a -)
#define ID_FMT "%s"
char ID[LINE_LENGTH];

// Third parameter (NAME): catch all up to [ (can be a string with spaces or a -)
#define NAME_FMT " %[^[]["
char NAME[LINE_LENGTH];

// Fourth parameter (DATE): catch all up to :
#define DATE_FMT " %11[^:]:"
char DATE[11+1];

// Fifth parameter (TIME): catch all up to space
#define TIME_FMT "%8s"
char TIME[8+1];

// Sixth parameter (TIMEZONE): catch all up to ] (-dddd must be converted in -dd)
#define TIMEZONE_FMT "%5d]"
int TIMEZONE;

// Seventh parameter (METHOD): catch all up to space
#define METHOD_FMT "%s"
char METHOD[LINE_LENGTH];

// Eigth parameter (DIR): catch all up to space
#define DIR_FMT "%s"
char DIR[LINE_LENGTH];

// Ninth parameter (STATUS): catch all up to space
#define STATUS_FMT "%s"
char STATUS[LINE_LENGTH];

// Tenth parameter (MB): catch all up to space
#define MB_FMT "%s"
char MB[LINE_LENGTH];

// Eleventh parameter (WEB): catch all inside "" (or -)

// Twelveth parameter (FROM): catch all inside "" (or -)



int main() {
    // variables
    char *line = malloc(LINE_LENGTH);
    char *token;
    int position = 0;

    // open files
    FILE *fpr = fopen("log.txt","r");
    FILE *fpw = fopen("myFormattedLogFile.txt","w");

    // read file
    while (fgets(line, LINE_LENGTH, fpr) != NULL) {

        int n = 0; 

        sscanf
            (
                line, 
                IP_FMT ID_FMT NAME_FMT DATE_FMT TIME_FMT TIMEZONE_FMT METHOD_FMT DIR_FMT STATUS_FMT MB_FMT " %n", 
                IP, ID, NAME, DATE, TIME, &TIMEZONE, METHOD, DIR, STATUS, MB, &n
            ); 

        NAME[strlen(NAME)-1] = '\0';

        fprintf
            (
                fpw, 
                "%s | %s | %s | %s | %s | %d | %s | %s | %s | %s\n", 
                IP, ID, NAME, DATE, TIME, TIMEZONE, METHOD, DIR, STATUS, MB
            );

    }

    // close files
    fclose(fpr);
    fclose(fpw);

    return 0;
}
JuMoGar
  • 1,740
  • 2
  • 19
  • 46
  • 1
    Don't use C; use Perl or another scripting language (awk, perhaps). It will be as fast as the C code and more flexible and much (much, much, much) easier to write. – Jonathan Leffler Dec 18 '18 at 17:19
  • @BurnsBA I have tried it, but I thought that it was not important post my code not working, it will not help. I do not know if C is the best choice for that, it is the lenguaje I know due to that I use it. I have never heard about `scripting tools` nor `scripting languaje`. I will read about it – JuMoGar Dec 18 '18 at 17:22
  • 1
    @JuMoGar It will help if you post your code, you should edit it into your post. – BurnsBA Dec 18 '18 at 17:23
  • @JonathanLeffler I have never write any code in Perl, I should learn it and I have no time now :(. Also, I use Windows so I can't use awk neither – JuMoGar Dec 18 '18 at 17:24
  • @BurnsBA Ok, I will add it – JuMoGar Dec 18 '18 at 17:35
  • 1
    If you use a POSIXy system (Linux, Mac), you could use a POSIX regular expression to split the input into fields, compiling the expression using [`regcomp()`](http://man7.org/linux/man-pages/man3/regex.3.html), reading each line using `getline()`, and applying the expression using `regexec()`, with each field as a submatch. (Each submatch is given as an offset, length tuple, referring to the input line.) Then, I'd probably just use `fwrite()` to output the fields, and `fputs()` for the separators and newline. Should not be many lines of code at all. – Nominal Animal Dec 18 '18 at 17:40
  • @NominalAnimal Thank you, but I am using Windows so I am not using POSIX system, so I can not do that – JuMoGar Dec 18 '18 at 17:44
  • @JuMoGar: Then add a [tag:Windows] tag, please. – Nominal Animal Dec 18 '18 at 17:45
  • @NominalAnimal Ok, added – JuMoGar Dec 18 '18 at 17:45
  • Note `NAME[strlen(NAME)-1] = '\0';` is not a good way to trim trailing spaces. See https://stackoverflow.com/a/26984026/2410359 and others. – chux - Reinstate Monica Dec 18 '18 at 22:03
  • Ok, I will take a look. Thank you. – JuMoGar Dec 18 '18 at 22:12

1 Answers1

2

sscanf() and "%n" can do the job. Some post process may be needed as with NAME.

With such complex formats, I suggest using string concatenation

// First parameter (IP): catch all up to space
#define IP_FMT "%s"
char IP[sizeof line];

// Second parameter (ID): catch all up to space (can be a string or a -)
#define ID_FMT "%s"
char ID[sizeof line];

// Third parameter (NAME): catch all up to [ (can be a string with spaces or a -)
#define NAME_FMT " %[^[]["
char NAME[sizeof line];

// Fourth parameter (DATE): catch all up to :
#define DATE_FMT " %11[^:]:"
char DATE[11+1];

// Fifth parameter (TIME): catch all up to space
#define TIME_FMT "%8s"
char TIME[8+1];

// Sixth parameter (TIMEZONE): catch all up to ] (-dddd must be converted in -dd)
#define TIMEZONE_FMT "%5d]"
int TIMEZONE;

// Other fields left for OP

int n = 0;
sscanf(s, IP_FMT ID_FMT NAME_FMT DATE_FMT TIME_FMT " %n", 
    ID, ID, NAME, DATE, TIME, &TIMEZONE, &n);

if (n == 0) return Invalid_Input;
trim(NAME);
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256