0

I have a text file with 4 tabulated columns, this text is the output of a program that don't recognice some characters like the apostrophe ' or the midpoint ·, and that erros are marked as Fz. The structure will be with the numbers are removed. Little explanation: each line is a TAG that represents morphosyntax of a word in Catalonian. In this language can be contractions (the apostrophes) at the beginning or at the end of a word.

I need to find those errors, erase that line and modify the previous or next line depending if the apostrophe is at the beginning or at the end of a word.

Example: - Original

s   segon   NCMN000 1
’   ’   Fz  1
l   litre   NCMN000 1
’   ’   Fz  1
esplèndida  esplèndid   AQ0FS0  1
l   litre   NCMN000 1
’   ’   Fz  1
armaren armar   VMIS3P0 0.388664
’   ’   Fz  1
l   litre   NCMN000 1
obeïren obeir   VMIS3P0 0.388664
t   t   AQ0CS0  0.0283644
‘   ‘   Fz  1
aparellen   aparellar   VMIP3P0 0.890625
‘   ‘   Fz  1
t   t   AQ0CS0  0.0283644

-Correct (Handtyped)

s'  P0300000    es
l'  DA0CS0  el
esplèndida  AQ0FS0  esplèndid
l'  PP3MSA00    ell
armaren VMIS3P0 armar
'l  PP3MSA00    ell
obeïren VMIS3P0 obeir
t'  PP2CS000    tu
aparellen   VMIP3P0 aparellar
‘t  PP2CS000    tu

As you can see, the same error has different changes as it corresponds to the previous or next line, even in the same line can also have multiple solutions (make a switch and user decides) depending if the gender and sex of contraction:

  • The emotion: (femenine) LA emoció -> l'emoció.

  • The advocate(lawyer): (masculine) EL advocat -> l'advocat

  • A few questions: 1> Is there any way to escape your data before feeding it to the program that doesn't understand them? ( to avoid the errors all-together) 2> Can you post your c-code? 3> How did you arrive at your corrected output? ( it doesn't appear to contain the same data as the original ) – n0741337 Dec 04 '13 at 01:01
  • 1> No, isn't possible. 2> http://pastecode.org/index.php/view/50344031 3> I edited it manually, it's OK. I removed the line with Fz, and edited the line above or below, then reorganize the columns – user3063678 Dec 04 '13 at 11:12
  • 1
    It is very hard to understand what you are trying to do. Are the numbers at the end of the lines being removed? Why are the codes like "VMIP3P0" sometimes the second field in a line, sometimes the third? What is the relevance of the apostrophes? Why does the error sometimes relate to the next line, and sometimes the previous line? – Gavin Smith Dec 04 '13 at 21:25
  • You are right @GavinSmith , It was my fault (the second output was wrong on two lines)... the structure is and the numbers are removed. Little explanation: each line is a TAG that represents morphosyntax of a word in Catalonian. In this language can be contractions (the apostrophes) at the beginning or at the end of a word. – user3063678 Dec 05 '13 at 10:19

5 Answers5

2

Use a regular expression to identify the error lines, for example this one:

^[’‘]\s+[’‘]\s+[F][z]\s+[[:d:]]

Go through your file line by line, assigning that line to a string buffer.

Compile the above regular expression with regcomp like that:

regex_t regex;
int reti = regcomp(&regex, "^[’‘]\s+[’‘]\s+[F][z]\s+[[:d:]]", 0);
reti = regexec(&regex, bufferAsString, 0, NULL, 0); //where bufferAsString
//is your file's buffer as a string ending in \0

and then check reti 's value, if it's true you have found your line and you can do whatever with it or its previous.

The regular expression means: line start followed immediately by a ’ OR a ‘, then followed by at least one or more spaces/tabs, followed by a ’ OR a ‘, followed by spaces, followed by an F and then a z, some spaces, and a single numerical digit.

The more precise you are the better, so if you know it's only going to be 4 spaces, change \s+ with \s\s\s\s, or if you know that after Fz the number is always 1 replace [[:d:]] with [1] .

Here is an example of regular expressions in C: Regular expressions in C: examples?

Community
  • 1
  • 1
RaidenF
  • 3,411
  • 4
  • 26
  • 42
  • Thanks for your code, but I don't need erase the line with Fz. I need remove it and edit the line above or below. Can I get the line above/below with a fgets() without move the pointer? – user3063678 Dec 04 '13 at 11:17
1

It would probably be better to use a program that gives you better output.

Failing that, something can be probably done with awk. Here is a partial solution, which assumes the error lines all relate to the previous line:

$3 == "Fz" { one = one "'"; }

$3 != "Fz" {
  output();
  one = $1; two = $2; three = $3;
}

END { output(); }

function output()
{
  if (one != "") print one " " three " " two
}

This gives the output

s' NCMN000 segon
l' NCMN000 litre
esplèndida AQ0FS0 esplèndid
l' NCMN000 litre
armaren' VMIS3P0 armar
l NCMN000 litre
obeïren VMIS3P0 obeir
t' AQ0CS0 t
aparellen' VMIP3P0 aparellar
t AQ0CS0 t

for the input you gave. As you can see it always appends an apostrophe to the first word on the previous line.

Gavin Smith
  • 3,076
  • 1
  • 19
  • 25
  • You make it look so easy! I'll continue with the C code and return with news. Thank you very much! – user3063678 Dec 05 '13 at 12:32
  • This is my new code: http://pastecode.org/index.php/view/891040 thanks for your help! Tomorrow I will implement your code in C to clean up my text file (remove the numbers and organize columns) – user3063678 Dec 05 '13 at 21:21
0

If all line with error does contain Fz and none other, you can do:

awk '!/Fz/' file
s   segon   NCMN000 1
l   litre   NCMN000 1
esplèndida  esplèndid   AQ0FS0  1
l   litre   NCMN000 1
armaren armar   VMIS3P0 0.388664
l   litre   NCMN000 1
obeïren obeir   VMIS3P0 0.388664
t   t   AQ0CS0  0.0283644
aparellen   aparellar   VMIP3P0 0.890625
t   t   AQ0CS0  0.0283644
Jotne
  • 40,548
  • 12
  • 51
  • 55
  • Thanks for your help, but not only remove the lines with Fz... I need to edit the lines above or below with that apostrophe – user3063678 Dec 04 '13 at 11:13
0

I still don't understand how you arrived at your output, but here are some changes for your c-code that will give you "rolling" line data at a field level. I've declared a struct to hold the fields, made three copies to hold lines, pointers to those structs to "roll" them as each new line is read and a function for loading the struct from line data. Currently the code prints out the input line inside the loadLine() function, but not exactly as input ( whitespaces are not preserved with the sscanf() call and the output uses a single space as a separator ). You'll need to implement your line test and printing logic in the section indicated( search for TODO ):

typedef struct {
    char one[125];
    char two[125];
    char three[125];
    char four[125];
    } INPUT_LINE;

int loadLine( char *fline, INPUT_LINE* line ) {
    int retval = sscanf( fline, "%s %s %s %s", line->one, line->two, \
        line->three, line->four );

    // TODO - comment this out later
    printf("%s %s %s %s\n", line->one, line->two, line->three, line->four );

    return( retval == 4 );
}

int Buscar_error(char *fname, char *str) {
    FILE *fp;
    int line_num = 1;
    int find_result = 0;
    char temp[512];
    INPUT_LINE first = { 0 };
    INPUT_LINE second = { 0 };
    INPUT_LINE third = { 0 };
    INPUT_LINE *first_ptr, *second_ptr, *third_ptr = NULL;

    // gcc users
    if((fp = fopen(fname, "r")) == NULL) {
        return(-1);
    }

    first_ptr = &first;
    second_ptr = &second;
    third_ptr = &third;

    while(fgets(temp, 512, fp) != NULL) {
        if( line_num == 1 )
            { loadLine( temp, first_ptr ); line_num++; continue; }
        else if( line_num == 2 )
            { loadLine( temp, second_ptr ); line_num++; continue; }
        else if( line_num == 3 ) { loadLine( temp, third_ptr ); }
        else if( line_num > 3 ) {
            // re-order the pointers so they're rolling
            first_ptr = second_ptr;
            second_ptr = third_ptr;
            third_ptr = first_ptr;
            memset( third_ptr, 0, sizeof( INPUT_LINE ) );
            loadLine( temp, third_ptr );
            }

        // TODO: Your tests go here with access to three lines at a time
        // via the pointers first_ptr, second_ptr, third_ptr.
        // print out only the data you want from those tests/lines

        line_num++;
    }

    // Close the file if still open.
    if(fp) { fclose(fp); }

    return(0);
}

Good luck!

n0741337
  • 2,474
  • 2
  • 15
  • 15
  • The output was typed manually, so that you only knew what I meant. Very surprised and grateful for your help, thanks for this big push! – user3063678 Dec 04 '13 at 20:33
  • If your new code does what you want, put that in an answer to the question and mark it as the answer. Don't confuse others hunting SO with the solution in the question. If someone else's answer was helpful - you could upvote them or mark their answer as the solution. – n0741337 Dec 05 '13 at 22:35
0

This is my last C code:

    int main(int argc, char *argv[]){

    if(argc < 3 || argc > 3){ //Si no se le pasa el nombre del archivo
        Uso(argv[0]); //Se muestra la ayuda de cómo usar el programa
        exit(1);
    }

    //system("clear"); Limpiamos la consola

    FILE *fps, *fpd;
    int linea = 2, fin = 0, opswitch=0;
    char temp1[80]="", temp2[80]="", temp3[80]="", apost[5]="’", punt[3]="·", opcion[5]="", paraula[20]="", etiqueta[10]="", generic[20]="";

    if((fps = fopen(argv[1], "r")) == NULL){ //Si el archivo no se abre correctamente
        return(-1);
    }
    if((fpd = fopen(argv[2], "w+")) == NULL){
        return(-1);
    }


    fgets(temp1, 80, fps);
    fgets(temp2, 80, fps);
    fgets(temp3, 80, fps);

    do{
        //Tiene apóstrofe y marca de error Fz
        if((strstr(temp2, apost)) != NULL && (strstr(temp2, "Fz")) != NULL){
            //printf("%d******\n1[%c - %c] 3[%c - %c] \n",linea, temp1[0], temp1[1], temp3[0], temp3[1]);
            //printf("%s\n%s\n%s\n", temp1, temp2, temp3);

            //Apóstrofe delante
            if(temp1[0]=='s' && temp1[1]=='\t'){ fprintf(fpd, "%s", "s’\tP0300000\tes\n");}
            if(temp1[0]=='m' && temp1[1]=='\t'){ fprintf(fpd, "%s", "m’\tPP1CS000\tjo\n");}
            if(temp1[0]=='d' && temp1[1]=='\t'){ fprintf(fpd, "%s", "d’\tSPS00\tde\n");}
            if(temp1[0]=='t' && temp1[1]=='\t'){ fprintf(fpd, "%s", "t’\tPP2CS000\ttu\n");}
            if(temp1[0]=='l' && temp1[1]=='\t'){
                printf("- S'ha trobat una ela apostrofada a la linea %d\n", linea);         
                printf("La linea següent diu:\t%s\n", temp3);
                //Elegir una de las opciones
                printf("Tria una opció:\n\t1) l’\tDA0CS0\tel\n\t2) l’\tDA0FS0\tel\n\t3) l’\tPP3MSA00\tell\nOpció:");
                do{             
                scanf("%d", &opswitch);
                if(opswitch <1 || opswitch >3) printf("Trieu una de les opcións posibles. Opció:");
                }while(opswitch !=1 && opswitch!=2 && opswitch !=3);
                switch(opswitch){
                case 1: fprintf(fpd, "%s", "l’\tDA0CS0\tel\n"); break;
                case 2: fprintf(fpd, "%s", "l’\tDA0FS0\tel\n"); break;
                case 3: fprintf(fpd, "%s", "l’\tPP3MSA00\tell\n"); break;
                }
            }

            //Apóstrofe detrás
            if(temp3[0]=='l' && temp3[1]=='\t'){ fprintf(fpd, "%s", "’l\tPP3MSA00\tell\n");}
            if(temp3[0]=='t' && temp3[1]=='\t'){ fprintf(fpd, "%s", "’t\tPP2CS000\ttu\n");}
            if(temp3[0]=='n' && temp3[1]=='\t'){ fprintf(fpd, "%s", "’n\tPP3CN000\ten\n");}
        }
        //Tiene punto medio y marca de error Fz
        if((strstr(temp2, punt)) != NULL && (strstr(temp2, "Fz")) != NULL){
            printf("- S'ha trobat una ela geminada a la linea %d\n", linea);            
            printf("La linea previa diu:\t%s", temp1);
            printf("La linea següent diu:\t%s\n", temp3);
            do{
            //El usuario pone la línea manualmente
            printf("\tFica la paraula:");
            scanf("%s", paraula);
            printf("\tFica l'etiqueta:");
            scanf("%s", etiqueta);
            printf("\tFica el genèric:");
            scanf("%s", generic);
            printf("\n%s\t%s\t%s\nVols agregar-la així?\t",paraula,etiqueta,generic);
            scanf("%s", opcion);

            }while(strcmp(opcion,"si"));
            fprintf(fpd, "%s\t%s\t%s\n", paraula, etiqueta, generic);
        }
        else{
        //TODO: Escribir la linea tal cual, cambiando el orden y quitando los números
        fprintf(fpd, "%s", temp2);      
        }

        strcpy(temp1,temp2);
        strcpy(temp2,temp3);
        if(fgets(temp3, 80, fps)==NULL) ++fin;
        ++linea;
    }while(fin <= 3);

    if(fps) fclose(fps); //Cerramos los 2 archivos si están abiertos
    if(fpd) fclose(fpd);


}