1

I am trying to get the URL of the first search result. So far, I have tried converting the page to HTML using InputStream and AsyncTask. and then reading the string, stripping out the first URL using java regex.

String str = result;
            String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
            Pattern pattern = Pattern.compile(regex);
            Matcher matcher = pattern.matcher(str);
            if (matcher.find()) {
                System.out.println(matcher.group());
                Toast.makeText(getBaseContext(), matcher.group(), Toast.LENGTH_LONG).show();
            }

My code works very well stripping out the first URL from an HTML file, but I have noticed that there are no URL's in the HTML file when I save it using an android device. There must be a better way of doing this.

Dr cola
  • 19
  • 3
  • Is the code above related with your error. You say when you download with Android (which, I guess, means you run your code in Android) you get no URLs in the file. Maybe share the code that downloads as it seem the one that is not working. Also, what do you get instead of URLs? – drkblog Sep 20 '20 at 03:21
  • @drkblog here is my code https://stackoverflow.com/a/32964969/14303192. Basically the HTML files on pc and android look similar, except the android one doesn't have any links. – Dr cola Sep 20 '20 at 04:12
  • Are you sure it is not the way you are looking at the file?Are you downloading the file from the phone to the PC to be sure? Unfortunately I can't test your code in Android. But it is really weird since your code seems to be using really standard classes. – drkblog Sep 20 '20 at 04:20
  • @drkblog On PC, I am saving the page as HTML and then viewing it in notepad. On Android, I am using my own code(asynctask with InputStream) and then emailing the result to myself and then viewing the email on pc. – Dr cola Sep 20 '20 at 12:36

1 Answers1

0

Instead of if(matcher.find()){} do while(matcher.find()){}

if there are multiple URLs in a single line, your regex will only parse the first URL in that line, ignoring any other important ones

i.e:

while((line = reader.readLine()) != null) {
      Matcher matcher = pattern.matcher(line);
      while(matcher.find()){
            String url = matcher.group();
      }
}

your code modified:

Pattern pattern = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]");
Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
    String url = matcher.group();
}

I'm guessing you're attempting to get the first result though, and you're bound to see a lot of random google.com URLs, I recommend using Jsoup, as it's highly not recommended to try and parse XML/HTML with REGEX, it gets messy, and that'll take care of it all for you EASILY.

i.e:

Document connection = Jsoup.connect("https://www.google.com/search?q=query").get();
        
// all results are grouped into containers using the class "g" (group)
Elements groups = connection.getElementsByClass("g");
    
// check if any results were found
if(groups.size() <= 0) {
    System.out.println("no results found!");
    return;
}
    
// get the first result
Element firstGroup = groups.first();
        
// get the href from from first result
String href = firstGroup.getElementsByTag("a").first().attr("href");
David Buck
  • 3,752
  • 35
  • 31
  • 35
Link
  • 1
  • 1
  • 2