I need to write a program that reads a Wikipedia source file and extracts all the links to other webpages. All the webpages look like example:
<a href="/wiki/PageName" title="PageName">Chicken</a>
I basically need to match the PageName after /wiki/ with the title and if they are the same, as above, then display just the PageName on the terminal.
However, the following should not be matched since it is not in the same format as above:
<a href="http://chicken.com>Chicken</a>
(this is a link to a normal website off Wikipedia)
<a href="/wiki/Chicken >Chicken</a
> (missing the title= section)
The output I am trying to achieve looks something like this:
Example output I am trying to achieve
I have worked on this for quite a while and have been able to do the following:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
FILE * file;
file = fopen(argv[1], "r");
char line[512];
char* search;
while(!feof(file)){
fgets(line,512,file);
search = strstr( line, "<a href=\"/wiki/");
if(search != NULL){
puts(search);
}
}
}
The code only filters out till /wiki/ but I am blank from here onward. I have tried searching a lot but unable to get a lead. Help would be highly appreciated.