I'm trying to program an API for discord and I need to retrieve two pieces of information out of the HTML code of the web page https://myanimelist.net/character/214 (and other similar pages with URLs of the form https://myanimelist.net/character/N
for integers N), specifically the URL of the Character Picture (in this case https://cdn.myanimelist.net/images/characters/14/54554.jpg
) and the name of the character (in this case Youji Kudou). Afterwards I need to save those two pieces of information to JSON.
I am using HTMLAgilityPack for this, yet I can't quite see through it. The following is my first attempt:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
foreach (var node in htmlNodes.Descendants("tr/td/div/a/img"))
{
Console.WriteLine(node.InnerHtml);
}
}
Unfortunately, this produces no output. If I followed the path correctly (which is probably the first mistake) it should be "tr/td/div/a/img". I get no errors, it runs, yet I get no output.
My second attempt is:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
var script = htmlDoc.DocumentNode.Descendants()
.Where(n => n.Name == "tr/td/a/img")
.First().InnerText;
// Return the data of spect and stringify it into a proper JSON object
var engine = new Jurassic.ScriptEngine();
var result = engine.Evaluate("(function() { " + script + " return src; })()");
var json = JSONObject.Stringify(engine, result);
Console.WriteLine(json);
Console.ReadKey();
}
But this also doesn't work.
How can I extract the required information?
EDIT:
So, I've come quite further now, and I've found a solution to finding the link. It was rather simple. But now I'm stuck with finding the name of the character. The website is structured the same on every other link there is (changing the last number) so, I want to find many different ones via for loop. Here's how I tried to do it:
for (int i = 1; i <= 1000; i++)
{
HtmlWeb web = new HtmlWeb();
var html = "https://myanimelist.net/character/" + i;
var htmlDoc = web.Load(html);
foreach (var item in htmlDoc.DocumentNode.SelectNodes("//*[@]"))
{
string n;
n = item.GetAttributeValue("src", "");
foreach (var item2 in htmlDoc.DocumentNode.SelectNodes("//*[@src and @alt='" + n + "']"))
{
Console.WriteLine(item2.GetAttributeValue("src", ""));
}
}
}
in the first foreach I would try to search for the name, which is concluded always at the same position (e.g http://prntscr.com/o1uo3c and http://prntscr.com/o1uo91 and to be specific: http://prntscr.com/o1xzbk) but I haven't found out how yet. Since the structure in the HTML doesn't have any body type I can follow up with. The second foreach loop is to search for the URL which works by now and the n should give me the name, so I can figure it out for each different character.