Extracting Wiki links using C

Question

I need to write a program that reads a Wikipedia source file and extracts all the links to other webpages. All the webpages look like example:

<a href="/wiki/PageName" title="PageName">Chicken</a>

I basically need to match the PageName after /wiki/ with the title and if they are the same, as above, then display just the PageName on the terminal.

However, the following should not be matched since it is not in the same format as above: <a href="http://chicken.com>Chicken</a> (this is a link to a normal website off Wikipedia) <a href="/wiki/Chicken >Chicken</a> (missing the title= section) The output I am trying to achieve looks something like this:

Example output I am trying to achieve

I have worked on this for quite a while and have been able to do the following:

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
  FILE * file;
  file = fopen(argv[1], "r");

  char line[512];
  char* search;

  while(!feof(file)){
    fgets(line,512,file);

    search = strstr( line, "<a href=\"/wiki/");

    if(search != NULL){
        puts(search);
    }
  }
}

The code only filters out till /wiki/ but I am blank from here onward. I have tried searching a lot but unable to get a lead. Help would be highly appreciated.

You will get working code a lot faster using Perl or Python. C is not the best language for parsing strings. It's certainly possible, but it requires C knowledge. — mvp, Oct 09 '18 at 04:43
`while(!feof(file))` [is wrong](https://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong). — melpomene, Oct 09 '18 at 05:08

Mayur · Answer 1 · 2018-10-09T06:59:25.240

2

Instead of while(!feof(file)) you can use while(fgets(line,512,file)) and by adding couple of validations your final code with expected output will look like,

#ifdef  _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#endif //  MSC

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    FILE * file;

    if (argc != 2)
    {
        return -1;
    }

    file = fopen(argv[1], "r");

    if (!file)
    {
        return -1;
    }
    char line[512];
    char* search;

    while (fgets(line, 512, file)) {
        search = strstr(line, "<a href=\"/wiki/");

        if (search != NULL) {
            char *title = _strdup(search);
            if (title)
            {
                char* start = strstr(title, ">");
                char* end = strstr(start, "<");
                if (end)
                {
                    *end = 0;
                }
                if (strlen(start) >= 2)
                {
                    puts(start + 1);
                }
                free(title);
                title = 0;
            }
        }
    }
    fclose(file);
    file = NULL;
    return 0;
}

edited Oct 09 '18 at 06:59

answered Oct 09 '18 at 05:50

Mayur

2,583
16
28

Thankyou! Also do you know how i could further refine the search? To match with the required output? – Hadi Zia Oct 09 '18 at 06:31
@HadiZia I have added more code get expected output, hope that this will be helpful to you. – Mayur Oct 09 '18 at 07:00
buddy i tried compiling the code but i got the following error, and i have no clue whats wrong. Help please :( error: #endif without #if #endif // MSC ^ test.c:30:27: warning: implicit declaration of function '_strdup' is invalid in C99 [-Wimplicit-function-declaration] char *title = _strdup(search); ^ test.c:30:19: warning: incompatible integer to pointer conversion initializing 'char *' with an expression of type 'int' [-Wint-conversion] char *title = _strdup(search); ^ ~~~~~~~~~~~~~~~ – Hadi Zia Oct 09 '18 at 18:52
1

You are using c99, so remove #ifndef (first 3 lines from code) replace _strdup with strdup. Let me know if that works for you. – Mayur Oct 09 '18 at 18:58
English_Wikipedia Internet encyclopedia Wikimedia Foundation Jimmy Wales protected page Creative Commons Attribution/ English-language the most articles of any of the editions 5,730,076 Simple English Wikipedia Wikipedia:Milestones dispute resolution Dutch Wikipedia Wikimedia Commons Wikimedia Foundation native language Arbitration Committee Jimmy Wales arbitrators German Wikipedia Criticism of Wikipedia Wikipedia Seigenthaler biography incident Segmentation fault: 11 Okay so it gives some outputs!!!! but results in a segmentation fault at the end :( – Hadi Zia Oct 09 '18 at 21:20
@HadiZia which wiki page you are using? Can you comment that url? – Mayur Oct 10 '18 at 09:09

score 1 · Answer 2 · edited Oct 09 '18 at 23:08

1

size_t sz;
fseek(file, 0L , SEEK_END);
sz=ftell(file);
rewind(file);
char line[sz+1];

This will probably fix the segmentation fault.

edited Oct 09 '18 at 23:08

Kuba hasn't forgotten Monica

95,931
16
151
313

answered Oct 09 '18 at 21:55

Mustafain Ali Khan

11
1

@Mayur yes it fixed the seg fault started to print some extra links that do not match the pattern. But it still prints – Hadi Zia Oct 09 '18 at 21:57

Extracting Wiki links using C

2 Answers2