0

I have a large .xml file and need to pull specific bits out of it. The things I need to pull out are encapsulated by a substring on either side. I need to write the output to a file.

I'm searching for the starting sub and from there for the ending sub, then copying it and putting it out over fprintf. I'm setting the start-pointer to the position of the last end pointer and it continues to search until it runs into the sigsegv.

I don't know how to stop the loop at the last occurrence of the substrings I'm searching for before it runs into the sigsegv.

An interesting problem I encountered is that if I output to stdout it prints everything I want to pull out and then breaks down. If I want to write it to a file it doesn't write the same thing but breaks down before it finishes and in the process losing the last 37 lines of output.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {

    FILE *fp;
    fp = fopen("C:/Users/entin/Desktop/IHP/Auswerte_Marko/TEMP/20190605204730250_S210D_PQ41701_TM2_TV2_MARK21Single_21Single.ega_rslt", "r");

    FILE *fw;
    fw = fopen("C:/Users/entin/Desktop/IHP/Auswerte_Marko/TEMP/t1.xml", "w");

    int f_length;
    fseek(fp, 0, SEEK_END);
    f_length = ftell(fp);
    char file[f_length + 1];
    rewind(fp);
    fread(file, f_length, 1, fp);
    file[f_length] = 0; 



    const char *SPattern = "<MeasData "; // start of substring
    const char *EPattern = "</MeasData>"; // end of substring
    char *start, *end;
    char *target = NULL;

    if (start = strstr(file, SPattern)) { // search for start substring
        start += strlen(SPattern);
        if (end = strstr(start, EPattern)) { // search for end substring
            target = (char *) malloc(end - start + 1);
            memcpy(target, start, end - start); // copying content between start and end pointers
            target[end - start] = '\0';

            start = end; // setting new start to old end
        }
    }

    if (target) fprintf(stdout, "%s%s%s\n", SPattern, target, EPattern); // assembling everything back together

    free(target);


    //while (end <= EOF) { // repeating till end of file is reached
    while (end != NULL && *end != 0){ //EDIT from comments
        char *target = NULL;
        if (start = strstr(start, SPattern)) { // startig search from last end pointer
            start += strlen(SPattern);
            if (end = strstr(start, EPattern)) {
                target = (char *) malloc(end - start + 1);
                memcpy(target, start, end - start);
                target[end - start] = '\0';

                start = end;
            }
        }

        if (target) fprintf(stdout, "%s%s%s\n", SPattern, target, EPattern);

        free(target);
    }

    fclose(fp);
    fclose(fw);
    getchar();
    return 0;
}

Here are the files:

Input File

Output to stdout that I want in a file

Output that I get when I write to a file

(only the last lines of the output matter)

  • 2
    `while (end <= EOF)` This is not correct. `EOF` is a constant value to indicate end of file. It is not suitable for anything but check for equality. "Less or equal" does not make any sense. Also `end` is a pointer while `EOF` is an integer value. Did you intent to use `while(end != NULL && *end != 0)`? – Gerhardh Aug 14 '19 at 09:56
  • 1
    `while (end <= EOF)`: this doen't make any sense: 1: you compare the _pointer_ `end` to `EOF` which is not a pointer; 2: you don't do any file operation within the while loop, so testing for `EOF` doesn't make sense anyway. – Jabberwocky Aug 14 '19 at 10:00
  • This is completely true,... now I feel a bit more stupid to not catch that myself. Sadly after implementing it didn't solve the problem, the sigsegv still comes up with the same result. – DerEntinator Aug 14 '19 at 10:04
  • Probably not the source of your problem, but: don't cast the result from `malloc()`, and do check it's not null. – Toby Speight Aug 14 '19 at 10:09
  • @DerEntinator `*end != NULL` is wrong, `*end` is not a pointer, `end` is a pointer. Read again the first comment carefully. – Jabberwocky Aug 14 '19 at 10:09
  • 1
    If `start` doesn't get set to a non-`NULL` value, then `end` won't get set. `end` has not been initialised before it is tested, so you have *undefined behaviour*. – Weather Vane Aug 14 '19 at 10:10
  • 3
    `char file[f_length + 1];` is dangerous if `f_length` is more than will fit in the stack frame. Oh, and don't ignore the return value of `fread()`. – Toby Speight Aug 14 '19 at 10:11
  • @WeatherVane The weird thing is, that it can output the correct thing to stdout, but not to a file directly – DerEntinator Aug 14 '19 at 10:23
  • So? It's undefined behaviour. – Weather Vane Aug 14 '19 at 10:24
  • 1
    That's probably because `stdout` is line buffered and each `printf` is shown immediately on the terminal. OTOH the file can have a larger buffer and when your program stops crashing&burning the buffer is not flushed. – Gerhardh Aug 14 '19 at 10:25
  • You could use `memmem()` instead of`strstr()` – wildplasser Aug 14 '19 at 10:26
  • 1
    Note that by using `fread` you can't guarantee that any sequence of characters is zero-terminated and thus is safe to apply `strstr`. If searching for text you should be using text-based input functions. Also you might miss a construction that starts in one data block and finishes in the next one. – Weather Vane Aug 14 '19 at 10:29
  • @WeatherVane what do you mean by text-based input functions? – DerEntinator Aug 14 '19 at 10:33
  • 1
    Properly parsing XML is **hard**. See [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) and [Why use an XML parser?](https://stackoverflow.com/questions/3597239/why-use-an-xml-parser). – Andrew Henle Aug 14 '19 at 11:10
  • Is there any other way to achieve what I want to do e.g. pulling string between two stings out of the file and save it to a new one? – DerEntinator Aug 14 '19 at 11:12
  • 1
    First rule: COMPILE WITH WARNINGS ENABLED. – Jim Balter Aug 14 '19 at 12:57

1 Answers1

0

You should not check end but start.

while (end != NULL && *end != 0){ //EDIT from comments
    char *target = NULL;
    if (start = strstr(start, SPattern)) { // startig search from last end pointer
        start += strlen(SPattern);
        if (end = strstr(start, EPattern)) {
            target = (char *) malloc(end - start + 1);
            memcpy(target, start, end - start);
            target[end - start] = '\0';

            start = end;
        }
    }

    if (target) fprintf(stdout, "%s%s%s\n", SPattern, target, EPattern);

    free(target);
}

If you found the last element and search for the next one, start will be NULL and you won't enter the if block. You don't change end in that case and call strstr again but now with start == NULL.

As far as I remember, strstr is not required to verify for valid pointers.

In your loop end will only ever become NULL if you find the start pattern but no end pattern. For a valid XML file this is rather unlikely to happen.

S.S. Anne
  • 15,171
  • 8
  • 38
  • 76
Gerhardh
  • 11,688
  • 4
  • 17
  • 39
  • Note that `end` is *uninitialised* before first use. The OP's code still has plenty more wrong with it and there isn't one simple fix. – Weather Vane Aug 14 '19 at 10:56