1

I'm working with Java regular expressions on Android platform.

I'm trying to search this HTML for defined a regular expression.

Here's my code:

    public void mainaaForWWW(String websiteSource){

    try {
        websiteSource = readDataFromWWW(websiteSource);
    } catch (IOException e1) {
        e1.printStackTrace();
    }

    ArrayList<String> cinemaArray = new ArrayList<String>();
    Pattern sample = Pattern.compile("<div class=\"theatre\">");
    Matcher secuence = sample.matcher(websiteSource);
    try {
        while (secuence.find()) {
            cinemaArray.add(secuence.group());
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

    titleTableForWWW = new String[cinemaArray.size()];
    for(int i = 0; i < titleTableForWWW.length; i++)
        titleTableForWWW[i] = cinemaArray.get(i);
}

The problem is quite strange, because when I debug the code, String websiteSource is okay (all HTML files are completely loaded), but there's only 4 while loops. In the HTML document I found manually 11 matches. This regex is simplified only to find what's going on. Any ideas?

Ok, my bad. I found a solution:

So, here's my code responsible for writing HTML source code to String:

public String readDataFromWWW(String UrlAdress) throws IOException
    {

        String line = null;
        URL url = new URL(UrlAdress);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "ISO-8859-2"));
        while (rd.readLine() != null) {
            line += rd.readLine();
        }

        System.out.println(line);

        return line;

I think that reading to string that way, may something messed up, so I replaced this method by this one:

public String readDataFromWWW(String UrlAdress) throws IOException
    {
        String wyraz = "";

         try {
                String webPage = UrlAdress;
                URL url = new URL(webPage);
                URLConnection urlConnection = url.openConnection();
                InputStream is = urlConnection.getInputStream();
                InputStreamReader isr = new InputStreamReader(is, "ISO-8859-2");

                int numCharsRead;
                char[] charArray = new char[1024];
                StringBuffer sb = new StringBuffer();
                while ((numCharsRead = isr.read(charArray)) > 0) {
                    sb.append(charArray, 0, numCharsRead);
                }
                wyraz = sb.toString();

            } catch (MalformedURLException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }

        return wyraz;
    }

And everything works FINE! Thanks a lot for clues and help. I think the problem was connected with newline durring writing String, but I'm not quite sure.

Mariusz Chw
  • 364
  • 2
  • 19
  • 4
    Parsing HTML with regex? Not unless you want to [summon Cthulhu](http://stackoverflow.com/a/1732454/1223693)! – tckmn Jan 27 '13 at 23:24
  • To be literal, it's parsing the String object. I was verified the content of this String and it's ok. – Mariusz Chw Jan 27 '13 at 23:37
  • @MariuszChw: There is nothing wrong with your regex, or the HTML file (I read the HTML file as UTF-8). Check whether the transfer terminate half way, or if there is something wrong with encoding. – nhahtdh Jan 27 '13 at 23:47
  • Length of string compare to html is ok, so there's no problem with transfer. I'm tried ISO-8859-2 and UTF-8 coding, the result is these same: only 4 loops. I really have no idea what's wrong :) – Mariusz Chw Jan 27 '13 at 23:54
  • See http://stackoverflow.com/a/1732454/870248 – Paul Vargas Jan 27 '13 at 23:54
  • 1
    I'd try `Pattern sample = Pattern.compile(Pattern.quote("
    "));` instead just to be safe.
    – Bernhard Barker Jan 27 '13 at 23:56
  • I tried this, and nothing new in this riddle :) – Mariusz Chw Jan 28 '13 at 00:01
  • 1
    Have you tried `System.out.println(websiteSource);` and checking whether something went wrong (and that you can see the 11 occurrences)? – Bernhard Barker Jan 28 '13 at 00:30
  • Yes, I tried. I also use debugger for look inside of String value, and it's ok – Mariusz Chw Jan 28 '13 at 00:40
  • I suggest to use http://htmlcleaner.sourceforge.net for parsing html – iMysak Jan 28 '13 at 01:02
  • In case the links to the Cthulhu answer are not clear, DO NOT PARSE HTML WITH REGEX. Your may get your code to work today, but it will be very brittle. Use a real HTML parser. – Jim Garrison Jan 28 '13 at 01:07
  • I fought the regex is simpliest way to get specified objects data from html. Ok, so I know now - it's bad practice to use regex with html. Any alternative ways? – Mariusz Chw Jan 28 '13 at 08:19

0 Answers0