0

So, I'm scraping from a lyric site and I want to format it just like the site has it. Right now when I get my output the string is all on the same line like this. I'm using Jsoup to get the information from the HTML. What I want to do is split each line before the capital letter like the lyrics on the site.

I was told a million times Of all the troubles in my way How I had to keep on trying Little better ev'ry day But if I crossed a million rivers And I rode a million miles Then I'd still be where I started Bread and butter for a smile Well I sold a million mirrors In a shop in Alley Way But I never saw my face In any window any day Well they say your folks are telling you To be a super star But I tell you just be satisfied To stay right where you are Keep yourself alive keep yourself alive It'll take you all your time and a money Honey you'll survive Well I've loved a million women In a belladonic haze And I ate a million dinners Brought to me on silver trays Give me ev'rything I need To feed my body and my soul And I'll grow a little bigger Maybe that can be my goal I was told a million times Of all the people in my way How I had to keep on trying And get better ev'ry day But if I crossed a million rivers And I rode a million miles Then I'd still be where I started Still be where I started Keep yourself alive keep yourself alive It'll take you all your time and money honey You'll survive Keep yourself alive Keep yourself alive It'll take you all your time and money To keep me satisfied Do you think you're better ev'ry day No I just think I'm two steps nearer to my grave Keep yourself alive Keep yourself alive mm You take your time and take your money Keep yourself alive Keep yourself alive Keep yourself alive All you people keep yourself alive Keep yourself alive Keep yourself alive It'll take you all your time and a money To keep me satisfied Keep yourself alive Keep yourself alive All you people keep yourself alive Take you all your time and money honey You will survive Keep you satisfied Keep you satisfied

What I want it formatted like is this: http://prntscr.com/4rt1cf

My code so far is this:

public static void lyricScrape() throws IOException {

    Scanner search = new Scanner(System.in);
    String artist;
    String song;
    Document doc;

        artist = search.nextLine();
        artist = artist.toLowerCase();
        artist = artist.replaceAll(" ", "");
        System.out.println("Artist saved");

        song = search.nextLine();
        song = song.toLowerCase();
        System.out.println("Song saved");
        song = song.replaceAll(" ", "");

        doc = Jsoup.connect("http://www.azlyrics.com/lyrics/"+artist+"/"+song+".html").get();
        Elements element = doc.select("div[style^=margin]");
        String lyrics = element.text();
        System.out.println(lyrics);


    }
Pshemo
  • 122,468
  • 25
  • 185
  • 269

2 Answers2

2

String.split takes a regex. The regex for a capital letter is "[A-Z]", but you want to retain the character, thus look for "\\ [A-Z]" (a space before). Finally make it not capture the letter:

String[] lines = lyrics.split("\\ (?=[A-Z])");
formatted = lyrics.replaceAll("\\ (?=[A-Z])", "\n");

To make up for the one-letter I, you can use

String[] lines = lyrics.split("\\ (?!I\\s)(?=[A-Z])");
formatted = lyrics.replaceAll("\\ (?!I\\s)(?=[A-Z])", "\n");
AlexR
  • 2,412
  • 16
  • 26
0

Answer based on How do I preserve line breaks when using jsoup to convert html to plain text?

How about adding some special text after each <br/> in your HMTL. This way when you call text() you will have instead of line<br/>line something like line[specialString]line and then you can just replace this [specialString] with \n. I mean something like

element.select("br").append("@REPLACEME@");
String lyrics = element.text().replaceAll("\\s*@REPLACEME@\\s*", "\n");

You can also use Jsoup.clean method on HTML text code of your lyrics to remove all unwanted tags like <b> <i> <!-- comments --> except tags defined by you like in this case <br /> and then replace this br tag with either \n or "" depending on if your HTML had actually line breaks after <br/>. So your code can look like

String lyrics = Jsoup.clean(
                    element.html(), //html to clean
                    Whitelist.none().addTags("br")//allowed tags
                ).replace("<br /> ", "");
Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269