0

I have created a basic web scraper using Jsoup in order to extract movie info from IMDB. However, when I scrape Genre, I cant help but get an output like this one:

Action Adventure Fantasy 27 April 2011 (UK)

Is there a way of using substring() so that It will take away the rest of the string when it hits a number? In this case, the number 27.

Thank you

Craig
  • 47
  • 6
  • After looking at the suggestions here, I see one flaw....what if the movie has digits in it like "Movie 48"? You could use this logic, write a for loop starting at the end of the string and find the first two digits after the month. Then use that as your index. Basically like Se Wong Jan's code but start at the end, not the beginning. The use the substring method from 0 to the index. – Ducksauce88 Oct 20 '13 at 05:40
  • Woah, totally did not think of that! Thank you very much for the heads up @Ducksauce88! – Craig Oct 20 '13 at 08:26
  • Just realized something actually. I'm dealing with genres here not an actual movie title so I should be safe but thanks again, I'll keep an eye out for similar flaws like the one you described – Craig Oct 20 '13 at 08:57

3 Answers3

1

Do you want to get everything before 27?

String target = targetString;
int targetLength = target.length();
int index = 0;

for (index = 0; index < targetLength; index++) {
    if (Character.isDigit(target.charAt(i))) {
        break;
    }
}

return target.substring(0, index);
Se Won Jang
  • 773
  • 1
  • 5
  • 12
1

You could use the split method to split the string at the first occurrence of a space followed by a digit.

String genreInfo = "Action Adventure Fantasy 27 April 2011 (UK)";
String[] tokens = genreInfo.split("\\s\\d");
String genres = tokens[0];
System.out.println(genres);
Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
0

A bad idea. IMDB seems to provide public APIs described here, so scraping is a poor approach.

Community
  • 1
  • 1
Kayaman
  • 72,141
  • 5
  • 83
  • 121