1

I am developing a spider reading URL from a text file and download the page the write the URL and the file content in another file with a \t between them.

When I get the page , it may contain line feed feed character which should be moved. But I do not know the page encoding before I get the page.

Now I am using JSOUP ,for it can handle the encoding problem for me. But I find that JSOUP parses the HTML to find the encoding which make it slow.

Is there a easy way to just remove the line feed character from the string or byte array?

Will this code work with UTF-8 or GBK?

                        byte[] buffer=new byte[4096];
                        String page="";

                        while((input.read(buffer))!=-1){
                            for(int i=0;i<buffer.length;i++)
                                if(buffer[i]=='\r'||buffer[i]=='\n'){
                                    buffer[i]=' ';
                                }
                            page+=new String(page);
                        }

I found the code above not work in utf-8 because a character in the Asian language may be longer than 8 or 16 bit , so wen I convert byte to String a character may be splited.

The code following works fine for me: int responseCode = connection.getResponseCode();

    if (responseCode >= 200 && responseCode < 300) {
                    InputStream input =connection.getInputStream();

                    byte[] buffer=new byte[BUFFER_SIZE];
                    byte[] urlBytes=(url+"\t").getBytes("ASCII");

                    System.arraycopy(urlBytes, 0, buffer, 0, urlBytes.length);
                    int t=0,index=urlBytes.length;
                    while((t=input.read())!=-1){
                        if(index>=buffer.length-1){ 
                            byte[] temp=new byte[buffer.length*3/2];
                            System.arraycopy(buffer, 0, temp, 0, buffer.length-1);
                            buffer=temp;
                        }
                        if(t=='\n'||t=='\r'){
                            t=' ';
                        }
                        buffer[index++]=(byte)t;
                    }
                    buffer[index++]='\n';
deepdark
  • 13
  • 3
  • You can probably use a regex to replace all occurrences of \t with an empty string. – Jonas Czech Apr 09 '15 at 10:45
  • You do know the encoding before you get the page. It's in a response header. – user207421 Apr 10 '15 at 03:07
  • @EJP , yeah , this is what JSOUP do . JSOUP try to find encoding in the header (the encoding may be not found), when failed it will prase the html ,looking for the encoding info in the html. But I wander it is very slow , and when I save the html from the JSOUP api , JSOUP get the HTML from DOM ,which is not the original one. – deepdark Apr 10 '15 at 03:39
  • If you do not know the encoding, you can not convert the byte stream into a character stream in which you can do a search for line breaks. You example code will not work for UTF-16 encoded text. – Raedwald Apr 10 '15 at 06:59
  • possible duplicate of [what is character encoding](http://stackoverflow.com/questions/10611455/what-is-character-encoding) – Raedwald Apr 10 '15 at 07:04
  • Well,the last code I post do not work for UTF-16, it base on a assumed precondition that the page encoding set is the superset of ascii (), such as UTF-8,GBK,GB18030. Not very familiar with UTF-16 , can you point out where the problem location? The \n code in UTF-16 is different with it in ASCII? @Raedwald – deepdark Apr 10 '15 at 07:27

1 Answers1

-1

Depending on the operating system, new lines can be \n, \r\n, or sometimes \r, but these are ASCII characters, they are always the same if the encoding is a superset of ASCII. In that case, just remove all \r and \n in your pages.

However this will not work for other encoding such as UTF-16.

WilQu
  • 7,131
  • 6
  • 30
  • 38
  • That will combine words which were separated only by a linebreak; better to replace CR LF, or best nonempty sequence of them, by a space. – dave_thompson_085 Apr 09 '15 at 18:07
  • They are *not* encoded the same way in all character encodings. In particular, the byte patterns for UTF-16 and UTF-8 are different. – Raedwald Apr 10 '15 at 07:01
  • @Readwald I edited my answer. I would delete it since I realize it actually doesn't really answer the question but I can't because it’s accepted. – WilQu Apr 15 '15 at 09:49