6

For a non-commercial private school project I'm creating a piece of software that will search for lyrics based on what song currently is playing on Spotify. I have to do this in C# (requirement), but I can use other languages if I so desire.

I've found a few sites that I can use to fetch the lyrics from. I have already succeeded in fetching the entire html code, but after that I'm not sure what to do. I've asked my teacher, she told me to use XML (which I also found complicated :p), so I've read quite a bit about it and searched for examples, but haven't found anything that seems applicable to my case.

Time for some code.

Let's say I wanted to fetch the lyrics from musixmatch.com:

(Human-readable altered) HTML:

<span data-reactid="199">
    <p class="mxm-lyrics__content" data-reactid="200">First line of the lyrics!
        These words will never be ignored
        I don't want a battle
    </p>
    <!-- react-empty: 201 -->
    <div data-reactid="202">
        <div class="inline_video_ad_container_container" data-reactid="203">
            <div id="inline_video_ad_container" data-reactid="204">
                <div class="" style="line-height:0;" data-reactid="205">
                    <div id="div_gpt_ad_outofpage_musixmatch_desktop_lyrics" data-reactid="206">
                        <script type="text/javascript">
                            //Really nice google ad JS which I have removed;
                        </script>
                    </div>
                </div>
            </div>
        </div>
        <p class="mxm-lyrics__content" data-reactid="207">But I got a war
            More fancy lyrics
            And lines
            That I want to fetch
            And display
            Tralala
            lala
            Trouble!
        </p>
    </div>
</span>

Note the first three lines of the lyrics are located at the top, with the rest in the bottom <p>. Also note that the two <p> tags have the same class. Full html source can be found here: view-source:https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here%E2%80%99s-a-War At around line 97 the snippet starts.

So in this specific example there are the lyrics, and there is quite a bit of code that I don't need. So far I've tried fetching the html code with the following C#:

string source = "https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here’s-a-War";

    // The HtmlWeb class is a utility class to get the HTML over HTTP
    HtmlWeb htmlWeb = new HtmlWeb();

    // Creates an HtmlDocument object from an URL
    HtmlAgilityPack.HtmlDocument document = htmlWeb.Load(source);

    // Targets a specific node
    HtmlNode someNode = document.GetElementbyId("mxm - lyrics__content");

    if (someNode != null)
    {
        Console.WriteLine(someNode);
    } else
    {
        Console.WriteLine("Nope");
    }

    foreach (var node in document.DocumentNode.SelectNodes("//span/div[@id='site']/p[@class='mxm-lyrics__content']"))
    {
        // here is your text: node.InnerText    "//div[@class='sideInfoPlayer']/span[@class='wrap']"
        Console.WriteLine(node.InnerText);
    }

    Console.ReadKey();

The fetching of the entire html works, but the extracting doesn't. I'm stuck at extracting the lyrics from the html. Since for this page the lyrics aren't in an ID tag, I can't just use the GetElementbyId. Can somebody point me in the right direction? I want to support multiple sites, so I have to do this a few times for different sites.

MagicLegend
  • 328
  • 1
  • 5
  • 22
  • 3
    Maybe it makes sense to use their api? it's free for 2K requests per day https://developer.musixmatch.com/mmplans. (JFYI) – Artiom Nov 30 '16 at 10:46
  • 1
    `mxm-lyrics__content` is the **class** of the element and not the Id, which is why `GetElementbyId` doesn't find it. You could use the [technique in this question](http://stackoverflow.com/questions/13771083/html-agility-pack-get-all-elements-by-class) to get it by class. – stuartd Nov 30 '16 at 10:49
  • @Artiom Well, it's indeed free, but it doesn't include full lyrics I believe? Given the fancy cross at 'Full Lyrics Display'? – MagicLegend Nov 30 '16 at 10:51
  • @stuartd I'll have a read. Haven't found that one yet :-) – MagicLegend Nov 30 '16 at 10:51
  • You shouldn't read the html part. Use their API like @Artiom mentioned. This will give you XML or JSON normally. You can easy read this with common C# Api's or Newtonsoft.JSON. The advantage on XML or JSON is, that you can easy convert the datastream in pure C#-Objects. – Sebi Nov 30 '16 at 10:52
  • 1
    @MagicLegend I've missed that. – Artiom Nov 30 '16 at 10:52
  • @Sebi You're right. But, as I mentioned in my edited comment; the API doesn't provide full lyrics for free I believe... – MagicLegend Nov 30 '16 at 10:54
  • @MagicLegend ok damn... The problem with html is that you don't really got a pattern which you can filter. There is no defined point which you can find to read the lyrics. Further if you find any pattern and use it, you are bound to the frontend of their website. If they change there html your program doesn't work. Although you would need to implement each site by itself, because the html will differ. – Sebi Nov 30 '16 at 11:00
  • @Sebi Yup. That's the thing I've been working on for the past two weeks :). I don't expect that there will be changed too much by the sites themselves, and otherwise I hope I've gathered enough knowledge to fix the new issues myself. For me it isn't really a problem to rely on the frontend patterns. There is a pattern in this example, the `class=mxm-lyrics__content`... There has to be a way to get the content from those `

    ` tags?

    – MagicLegend Nov 30 '16 at 11:04
  • @Sebi well, API changes too, but so frequently of course) – Artiom Nov 30 '16 at 11:14
  • @Artiom True on freeware there is possible no big difference. Else Api-Changes are communicated to the users normaly or become new versions and will still work. There you have better possibilities to react or sth. But you are right, in a free school project this isn't a problem. – Sebi Nov 30 '16 at 11:20

1 Answers1

3

One of the solutions

var htmlWeb = new HtmlWeb();
var documentNode = htmlWeb.Load(source).DocumentNode;

var findclasses = documentNode.Descendants("p")
    .Where(d => d.Attributes["class"]?.Value.Contains("mxm-lyrics__content") == true);
//or
var findclasses = documentNode.SelectNodes("//p[contains(@class,'mxm-lyrics__content')]")
var text = string.Join(Environment.NewLine, findclasses.Select(x => x.InnerText));
Artiom
  • 7,694
  • 3
  • 38
  • 45
  • Nice solution. Thought about Regex first, but this is far better. – Sebi Nov 30 '16 at 11:24
  • Thank you! Works like a charm. Do you have some documentation (how is a notation like that even called?) on the magic that you execute with the first `findclasses` var? How do you build something like that? – MagicLegend Nov 30 '16 at 11:25
  • @Sebi Regex is considered as not the best solution for parsing HTML. Check this answer http://stackoverflow.com/a/1732454/797249. It's epic – Artiom Nov 30 '16 at 11:25
  • 1
    @MagicLegend Search for Linq ;) – Sebi Nov 30 '16 at 11:26
  • @MagicLegend var is implicit type. If you declare a variable string like `string value=""` you know (and compiler knows) it's a string, no need to write it explicitly. Also it forces you to name variables in a better way. – Artiom Nov 30 '16 at 11:28
  • @Artiom Amazing answer :p A lot of these questions get solved by regex tho, I've tried using it myself for this problem... Why is that then? Since it's apparently not the best way to do so? – MagicLegend Nov 30 '16 at 11:28
  • @Artiom I was after the Linq apparently! But when you know it's going to be a string, why still use `var` for the declaration? Is there a downside to that? – MagicLegend Nov 30 '16 at 11:30
  • @MagicLegend no downsides (AFAIK). No need to use full name cuz it's obvious (mostly) from the assigning part. If method returns some value it should be understandable what it does from it's name. (see Clean Code) – Artiom Nov 30 '16 at 11:33
  • 1
    @MagicLegend Using var is opinion based. But in many cases it makes your code more clean. Think you instanitate a Dictionary: Dictionary> dic = new Dictionary> versus var dic = new Dictionary> – Sebi Nov 30 '16 at 11:35