0

I am wondering if anyone knows of a good technique to get all text on a current webpage from a Java application.

I have tried two methods:

  1. OCR: this wasn't accurate enough for me to use as the text was roughly only 60% correct. Also it only got the text that the screen shot could see, I need all text on the page

  2. Robot class: the method I have got now is using the robot class to us the Control-A, Control-C method and then taking the text from the clipboard. In terms of getting the text this method has proved useful. The only problem I have with it is the user sees the highlighted text for a split second, something I don't want them to see.

This might sound to some as some form of spyware, though this is a final year university project and its an anti cyber bulling/child grooming program, and will only store info when it detects foul play.

Can anyone think of a better way to get the text off the browser?

Many thanks

Kevin Panko
  • 8,356
  • 19
  • 50
  • 61

5 Answers5

2

Get the URL and read the page with an HTTP client class. i.e. Apache Commons HTTPGet.

For more information, read here: http://hc.apache.org/httpclient-3.x/tutorial.html

Andres
  • 10,561
  • 4
  • 45
  • 63
  • The problem with that is permission, with the likes of Social media sites. The sites I will be monitoring will be social media sites. – user3253722 Jan 30 '14 at 15:49
  • 1
    You can authenticate on Social networks from Java code. Take a look at Spring Social http://projects.spring.io/spring-social/ – Andres Jan 30 '14 at 15:51
  • thanks for the reply... I have looked into Facebook API's etc, but using them would defeat the purpose of my program. You need to know users username and password. My program is for parents to monitor their child on social media sites, but if they know username and passwords there would be no need for my program lol i dont want to use recording key strokes because that would be just monitoring what the child says, I want to detect bullying from other people.. – user3253722 Jan 30 '14 at 16:16
  • if somehow i got excess to a browsers source code, without the user knowing that would work great... – user3253722 Jan 30 '14 at 16:18
  • You can, with Chromium, and Firefox. – Andres Jan 30 '14 at 16:42
  • Cheers for reply- surely I would still need to know username and password of user to see the text to use the methods provided by spring social? – user3253722 Jan 30 '14 at 16:59
1

You can simply get all HTML from the website using URLConnection or Apache's HTTPClient Here's the question explaining how to do that: Get html file Java

Ofcourse it will not give You text in binaries (ie flash files) images etc. For those only OCR will work.

Community
  • 1
  • 1
1

You can try something like this

GetMethod get = new GetMethod("http://ThePage.com");
InputStream in = get.getResponseBodyAsStream();
String htmlText = readString(in);

static String readString(InputStream is) throws IOException {
char[] buf = new char[2048];
Reader r = new InputStreamReader(is, "UTF-8");
StringBuilder s = new StringBuilder();
while (true) {
   int n = r.read(buf);
    if (n < 0)
      break;
    s.append(buf, 0, n);
  }
  return s.toString();
}
Juan Rada
  • 3,513
  • 1
  • 26
  • 26
  • I dont think this will work for social media sites will it? – user3253722 Jan 30 '14 at 15:51
  • Just just need to figured out in wich way the page load the content that you want. For example the news text can be retriver using json – Juan Rada Jan 30 '14 at 15:54
  • I need the contents of a page in view. With social sites this causes a problem because of permissions. Im just wondering if there's a way to get into the currently running browser, to take the source code of the active page. Not sure if it can be done to be honest. My copy to clipboard technique works great as it does what I want, although there's a split second where the user sees the highlighted text. The techniques a bit dirty – user3253722 Jan 30 '14 at 16:32
0

the most generic solution would be traffic sniffer.

  • thats what my tutor suggested..though provided no help in how to do this. Ive only got like months to do this, and im pretty sure traffic sniffing is way outta my league. Cheers for the reply mate – user3253722 Jan 30 '14 at 17:03
  • that was meant to be 2 months** with traffice sniffing can u read in the source code of a browser? – user3253722 Jan 30 '14 at 17:15
  • You can analyze every TCP packet on the computer. You can find those which contains html and analyze text in it. I guess You can have problem with https - its encrypted. – Wojtek Mlodzianowski Jan 31 '14 at 12:50
  • Any idea how I could go about that? Is there any java libraries about that could help. Probably wont be able to do it as most social sites are https – user3253722 Jan 31 '14 at 21:38
-1

This is the utility class I created for this purpose. It has runtime and non-runtime versions, and also provides for verifying the tail end of the retrieved source.

   import  java.io.BufferedInputStream;
   import  java.io.IOException;
   import  java.io.InputStream;
   import  java.net.MalformedURLException;
   import  java.io.EOFException;
   import  java.net.URL;

/**
   <P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>

   <P>Demo: {@code java AppendWebPageSource}</P>
 **/
public class AppendWebPageSource  {
   public static final void main(String[] igno_red)  {
      String sHtml = AppendWebPageSource.get("http://usatoday.com", null);
      System.out.println(sHtml);   

      //Alternative:
      AppendWebPageSource.append(System.out, "http://usatoday.com", null);
   }
   /**
      <P>Get the source-code from a web page, with runtime-errors only.</P>

      @return  {@link #append(Appendable, String, String) append}{@code ((new StringBuilder()), s_httpUrl, s_endingString)}
    **/
   public static final String get(String s_httpUrl, String s_endingString)  {
      return  append((new StringBuilder()), s_httpUrl, s_endingString).toString();
   }
   /**
      <P>Append the source-code from a web page, with runtime-errors only.</P>

      @return  {@link #appendX(Appendable, String, String) appendX}{@code (ap_bl, s_httpUrl, s_endingString)}
      @exception  RuntimeException  Whose {@link getCause()} contains the original {@link java.io.IOException} or {@code java.net.MalformedURLException}.
    **/
   public static final Appendable append(Appendable ap_bl, String s_httpUrl, String s_endingString)  {
      try  {
         return  appendX(ap_bl, s_httpUrl, s_endingString);
      }  catch(IOException iox)  {
         throw  new RuntimeException(iox);
      }
   }
   /**
      <P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>

      <P><I>I got this from {@code <A HREF="http://www.davidreilly.com/java/java_network_programming/">http://www.davidreilly.com/java/java_network_programming/</A>}, item 2.3.</I></P>

      @param  ap_bl  May not be {@code null}.
      @param  s_httpUrl  May not be {@code null}, and must be a valid url.
      @param  s_endingString  If non-{@code null}, the web-page's source-code must end with this. May not be empty.
      @see  #get(Appendable, String, String)
      @see  #append(Appendable, String, String)
    **/
   public static final Appendable appendX(Appendable ap_bl, String s_httpUrl, String s_endingString)  throws MalformedURLException, IOException  {
      if(s_httpUrl == null  ||  s_httpUrl.length() == 0)  {
         throw  new IllegalArgumentException("s_httpUrl (\"" + s_httpUrl + "\") is null or empty.");
      }
      if(s_endingString != null  &&  s_endingString.length() == 0)  {
         throw  new IllegalArgumentException("s_endingString is non-null and empty.");
      }

      // Create an URL instance
      URL url = new URL(s_httpUrl);

      // Get an input stream for reading
      InputStream is = url.openStream();

      // Create a buffered input stream for efficency
      BufferedInputStream bis = new BufferedInputStream(is);

      int ixEndStr = 0;

      // Repeat until end of file
      while(true)  {
         int iChar = bis.read();

         // Check for EOF
         if (iChar == -1)  {
            break;
         }

         char c = (char)iChar;

         try  {
            ap_bl.append(c);
         }  catch(NullPointerException npx)  {
            throw  new NullPointerException("ap_bl");
         }

         if(s_endingString != null)  {
            //There is an ending string;
            char[] ac = s_endingString.toCharArray();

            if(c == ac[ixEndStr])  {
               //The character just retrieved is equal to the
               //next character in the ending string.

               if(ixEndStr == (ac.length - 1))  {
                  //The entire string has been found. Done.
                  return ap_bl;
               }

               ixEndStr++;
            }  else  {
               ixEndStr = 0;
            }
         }
      }

      if(s_endingString != null)  {
         //Should have exited at the "return" above.
         throw  new EOFException("s_endingString " + (new String(s_endingString)) + " (is non-null, and was not found at the end of the web-page's source-code.");
      }
      return  ap_bl;
   }
}
aliteralmind
  • 19,847
  • 17
  • 77
  • 108
  • Very smart code you have done there. Im jealous lol Problem is I cant use it cause of permissions on social media sites. Im just wondering if theres a way to get into the browser and take the source code from the current webpage the user is on.. a social media site.. – user3253722 Jan 30 '14 at 16:27
  • This utility is for public sites only, that's true, although I'm sure it could be rejiggered. If you want to get into the browser, though, it seems you'll need to get into the actual browser code itself, for Chrome, Firefox, IE, Safari, etc. Either that, or as suggested by @void-tec, sniffing. Either way, pretty ambitious. – aliteralmind Jan 30 '14 at 17:35
  • Yeah the packet sniffing would be good. Though im not sure if I have the time to experiment with it to get it working. Thanks for the reply lad – user3253722 Jan 30 '14 at 18:27
  • See with the sniffing technique is it possible to get all source code of a currently active webpage? – user3253722 Jan 30 '14 at 21:39
  • Hard to see the point. This doesn't do anything that HttpURLConnection doesn't already do. The site you got that from is Grade A drivel BTW. The 'ending string' has no point whatsoever. HTTP doesn't have or require an ending string. And where are the 'runtime and non-runtime versions', whatever that means? – user207421 Mar 28 '15 at 23:07
  • @EJP In the limited sense that I use it, it works. I'd like to see what it should be replaced with. I meant to say there's a main function has checked exceptions, and another that wraps it in an unchecked exception. Wrong terms. – aliteralmind Mar 28 '15 at 23:30
  • @EJP The ending string has no point whatsoever? What are you talking about? It allows you to download the webpage until a certain snippet of HTML is detected. It has nothing to do with the HTTP protocol. It's a useful feature. – aliteralmind Mar 29 '15 at 00:33