1

So I fetch a string from a website via code from another question I posted here. This works really well when I put it into a rich textbox, but, now I need to split the string into seperate sentences in a list/array (suppose list will be easier, since you don't need to determine how long the input is going to be).

Yesterday I found the following code at another question (didn't note the question, sorry):

List<string> list = new List<string>(Regex.Split(lyrics, Environment.NewLine));

But the input is now spliting into two parts, the first three sentences and the rest.

I retrieve the text from musixmatch.com with the following code (added fixed url for simplicity):

var source = "https://www.musixmatch.com/lyrics/Krewella/Alive";
var htmlWeb = new HtmlWeb();
var documentNode = htmlWeb.Load(source).DocumentNode;

var findclasses = documentNode
    .Descendants("p")
    .Where(d => d.Attributes["class"]?.Value.Contains("mxm-lyrics__content") == true);

var text = string.Join(Environment.NewLine, findclasses.Select(x => x.InnerText));

More information about this code can be found here. What it does in a nutshell is it retrieves specific html that has the lyrics in it. I need to split the lyrics line by line for a synchronization process that I'm building (just like was built-in in Spotify a while ago). I need something (preferably an list/array) that I can index because that would make the database to store all this data a bit smaller. What am I supposed to use for this process?

Edit: Answer to the mark of a possible duplicate: C# Splitting retrieved string to list/array

Community
  • 1
  • 1
MagicLegend
  • 328
  • 1
  • 5
  • 22
  • You say "into separate sentences", but also "line by line". Which is it? Sentences are usually ended by one of `.`, `!`, or `?` (but you would need to watch out for numbers or embedded quotes that may have additional punctuation). – crashmstr Dec 15 '16 at 12:58
  • 1
    I've just looked at the sample, it's gonna be tough as they seem to split it into almost random tags in the HTML. – nik0lai Dec 15 '16 at 12:59
  • @crashmstr I'm sorry! You are correct! It's line by line. The lyrics almost never have a dot to end a line :) – MagicLegend Dec 15 '16 at 12:59
  • The text almost certainly *doesn't* use `\r\n`, which is what `Environment.NewLine` evaluates to on Windows. Try splitting by `\n` or use a StreamReader/StringReader and `ReadLine()` to let the class detect the proper character – Panagiotis Kanavos Dec 15 '16 at 12:59
  • Possible duplicate of [Split text into sentences in C#](http://stackoverflow.com/questions/4957226/split-text-into-sentences-in-c-sharp) – Simon Karlsson Dec 15 '16 at 13:00
  • How about splitting it by capital letter? As every line seems to start with one? Not perfect but unless you know it's gonna be the same new line character you will struggle. – nik0lai Dec 15 '16 at 13:01
  • @nik0lias It isn't random tho. If you take a look at the linked question you'll find that it's a specific class that has the lyrics. – MagicLegend Dec 15 '16 at 13:01
  • @SimonKarlsson That question relies on characters at the end of each sentence, which isn't the case with most lyrics. I can't use that code since it relies on those characters to split the sentences. – MagicLegend Dec 15 '16 at 13:02
  • 1
    @MagicLegend The problem is that there are two `p` tags with the lyrics and you just concatenate them by a new line, so when you split it there are only two parts. You need to split on whatever is breaking the actual lines. – juharr Dec 15 '16 at 13:03
  • 1
    @MagicLegend the example has lyrics split into two divs, with the same

    mxm-lyrics__content

    – nik0lai Dec 15 '16 at 13:03
  • 1
    That's indeed something that I haven't realized myself. You gentlemen are correct in that. – MagicLegend Dec 15 '16 at 13:04
  • 2
    @MagicLegend your text uses `\n`, not `\r\n`. On Windows, Environment.Newline is `\r\n`. Split by `\n` – Panagiotis Kanavos Dec 15 '16 at 13:05
  • @PanagiotisKanavos That works! If you could post that as an answer I can mark it :-) Thank you all for thinking with me! – MagicLegend Dec 15 '16 at 13:08
  • just `Debug.Print(string.Replace("\r", @"\r").Replace("\n", @"\n"))` to see what are the characters – Slai Dec 15 '16 at 13:08
  • @PanagiotisKanavos Friendly reminder: Can you post the comment as an answer so I can mark it? :-) – MagicLegend Dec 15 '16 at 13:31
  • @MagicLegend Slai gave a better answer – Panagiotis Kanavos Dec 15 '16 at 13:38

2 Answers2

4

You can split by both:

var lines = string.Split(new char[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
Slai
  • 22,144
  • 5
  • 45
  • 53
  • Sorry, forgot to add a comment here. I tweaked the line a little so it would insert the output into a list, so it would become `List list = new List(lyrics.Split(new char[] { '\r', '\n' }));`. I also needed the emptyentries, hence I removed that addition. Thank you for the help :) – MagicLegend Dec 15 '16 at 17:48
0

What I would do is to ensure that there is a common concept of "NewLine" in the code. It could be \r, \n or \r\n. Simply replace all '\n' with "". (Edited this one)

Now, all you have to do is

var lyricLines = lyricsWithCommonNewLine.Split('\r')
K Ekegren
  • 218
  • 1
  • 6
  • Why `\r` when it's `\n` that's guaranteed to appear in all cases? Removing `\r` makes sense, replacing `\n` with `\r` though, not so much – Panagiotis Kanavos Dec 15 '16 at 13:03
  • @PanagiotisKanavos What would you suggest then? – MagicLegend Dec 15 '16 at 13:04
  • I did, both here and in the comments. Just split by `\n`. In general, you can use `Replace("\r",""). Or use a Regex that matches both cases. Or use a StringReader. The regex and StringReader will be the fastest – Panagiotis Kanavos Dec 15 '16 at 13:07
  • `Environment.NewLine` will always return adequate new line symbol. – mrogal.ski Dec 15 '16 at 13:09
  • @PanagiotisKanavos, you are right, that was just a brain fart. Edited the answer. – K Ekegren Dec 15 '16 at 13:12
  • @m.rogalski no, it won't if by adequate you mean "correct". It will return the *current environment's* newline. That page though uses `\n` only – Panagiotis Kanavos Dec 15 '16 at 13:12
  • @KEkegren why do you insist on Mac's newline ? – Panagiotis Kanavos Dec 15 '16 at 13:13
  • @m.rogalski Environment.NewLine does not always return the adequate new line symbol. It's just the preferred New Line for you current environment but that string with lyrics does not come from your environment. – K Ekegren Dec 15 '16 at 13:14
  • @PanagiotisKanavos That was just arbitrary. It could have been \n as well. Or even "MySpecialNewLineReplacement". In this case, that won't go into the final result anyway – K Ekegren Dec 15 '16 at 13:15
  • [read this](https://msdn.microsoft.com/en-us/library/system.environment.newline(v=vs.110).aspx#Anchor_1). Content of `NewLine` depends on platform and .NET implementation. – mrogal.ski Dec 15 '16 at 13:16
  • @m.rogalski yes, you should read that link. It depends on *platform* and the source of that text is a *different* platform. Which is why the OP's code failed. That's why it's inappropriate in this case – Panagiotis Kanavos Dec 15 '16 at 13:19
  • With the answer edited by @PanagiotisKanavos very correct point in mind, perhaps an up vote would be in place? ;) – K Ekegren Dec 15 '16 at 13:22
  • Well, the `\r` doesn't work, but the suggested `\n` does :) As of right now your answer doesn't fully work for me, hence I didn't mark it yet ;) – MagicLegend Dec 15 '16 at 13:24