I am developing a spider reading URL from a text file and download the page the write the URL and the file content in another file with a \t between them.
When I get the page , it may contain line feed feed character which should be moved. But I do not know the page encoding before I get the page.
Now I am using JSOUP ,for it can handle the encoding problem for me. But I find that JSOUP parses the HTML to find the encoding which make it slow.
Is there a easy way to just remove the line feed character from the string or byte array?
Will this code work with UTF-8 or GBK?
byte[] buffer=new byte[4096];
String page="";
while((input.read(buffer))!=-1){
for(int i=0;i<buffer.length;i++)
if(buffer[i]=='\r'||buffer[i]=='\n'){
buffer[i]=' ';
}
page+=new String(page);
}
I found the code above not work in utf-8 because a character in the Asian language may be longer than 8 or 16 bit , so wen I convert byte to String a character may be splited.
The code following works fine for me: int responseCode = connection.getResponseCode();
if (responseCode >= 200 && responseCode < 300) {
InputStream input =connection.getInputStream();
byte[] buffer=new byte[BUFFER_SIZE];
byte[] urlBytes=(url+"\t").getBytes("ASCII");
System.arraycopy(urlBytes, 0, buffer, 0, urlBytes.length);
int t=0,index=urlBytes.length;
while((t=input.read())!=-1){
if(index>=buffer.length-1){
byte[] temp=new byte[buffer.length*3/2];
System.arraycopy(buffer, 0, temp, 0, buffer.length-1);
buffer=temp;
}
if(t=='\n'||t=='\r'){
t=' ';
}
buffer[index++]=(byte)t;
}
buffer[index++]='\n';