12

I use to employ the following method to detect if a character is a whitespace:

Character.isWhiteSpace(char character);

Now I need to detect all the variants of line breaks (\n, \r, etc.) for all platforms (Linux, Windows, Mac OSX, etc.). Is there any similar way to detect if a character is a line break? If there is not, how can I detect all the possible variants?


Edit from comments: As I didn't know that line breaks can be represented by several characters, I add some context to the question.

I'm implementing the write(char[] buffer, int offset, int length) method in a Writer (see Javadoc). In addition to other operations, I need to detect line breaks inside the buffer. I'm trying to avoid creating an String from the buffer to preserve memory, as I've seen that sometimes the buffer is too big (several MB).

Is there any way to detect line breaks without creating a String?

jeojavi
  • 876
  • 1
  • 6
  • 15
  • 1
    check if char is System.getProperty("line.separator") – Thusitha Thilina Dayaratne Sep 18 '14 at 14:13
  • 1
    What do you mean by "etc"? And bear in mind that on Windows, the normal separator is `"\r\n"`, so not a single character. What bigger problem are you trying to solve? If you're trying to break a string into lines, consider using `BufferedReader` wrapping a `StringReader` instead. – Jon Skeet Sep 18 '14 at 14:14
  • @JonSkeet didn't know that a line break could be represented by two characters, thanks for the advice – jeojavi Sep 18 '14 at 14:16
  • @ThusithaThilinaDayaratne thanks for the suggestion but I was looking for a method to detect a line break character by character – jeojavi Sep 18 '14 at 14:18
  • @ThusithaThilinaDayaratne, System.getProperty("line.separator") does not return a char, and the content can be a sequence (like \r\n). – Martin Sep 18 '14 at 14:20
  • 1
    @JaviFernández, this should be the way to go. However, you cannot test a line break in Java "character by character" because it is often a sequence. So you need to find the sequence in the String... – Martin Sep 18 '14 at 14:21
  • @JonSkeet In summary I'm trying to remove all whitespace repetitions that are not line breaks while writing into a `Writer`. For example, the output for `this is an\n example` would be `this is an\nexample`. I also make other complex operations. – jeojavi Sep 18 '14 at 14:24
  • @Martin so for example if in Windows I use only `\r` or `\n`, it would not be considered as a line break? – jeojavi Sep 18 '14 at 14:29
  • It definitely sounds like this would be simpler by splitting into lines first, using existing functionality. – Jon Skeet Sep 18 '14 at 14:31
  • @JonSkeet As I'm implementing a `Writer` my only source is an array of `char`s. What functionalities are you suggesting? – jeojavi Sep 18 '14 at 14:33
  • `input.indexOf(System.getProperty("line.separator"));` would do part of the trick. Windows generates `\r\n`, Unix/Linux `\n`, older MACs `\r`, so you will need to check where your input comes from. When parsing Strings or reading from Files, using the `BufferedReader` normally works fine... – Martin Sep 18 '14 at 14:34
  • @JaviFernández, where thes the buffer of char come from? Can you post a sample of that input code? – Martin Sep 18 '14 at 14:35
  • @Martin thanks for the suggestion but my input is not an `String`, it's an array of `char`s – jeojavi Sep 18 '14 at 14:36
  • @Martin `public void write(char[] buffer, int offset, int length) throws IOException { ...` (from http://docs.oracle.com/javase/7/docs/api/java/io/Writer.html) – jeojavi Sep 18 '14 at 14:37
  • @JaviFernández: `String myString = new String(buffer);` should do the trick... However, trying to parse something you want to use for a Writer sound suspicious, there might be a much easier way to accomplish what you need... – Martin Sep 18 '14 at 14:51
  • @Martin it has to be a `Writer` because it's the only way to use it with a third party library. – jeojavi Sep 18 '14 at 14:57
  • That's perfectly fine, but The writer is where you will send your information, right? So, where does it come from? – Martin Sep 18 '14 at 15:04
  • @Martin it comes from an `InputStream` which contains a sequence of bytes from a PDF int the Internet. The third party library parses that PDF and offers the possibility of writing the content of the PDF into a `Writer` instead of return a String. The resulting text contains lots of whitespaces and I wanted to remove them before storing it into a database to save space. Some PDF documents can be very big (hundreds of MB) so if a create a new String from the memory needed would be double. That's why I went for the `Writer` approach. – jeojavi Sep 18 '14 at 15:15
  • OK, so you are willing to pass a custom Writer to the third party library that will "filter" what you need? – Martin Sep 18 '14 at 15:22
  • The third party library only extracts the text from the PDF, as the original document is only a sequence of bytes. The Writer is used to store the final text. – jeojavi Sep 18 '14 at 15:24
  • OK, but using the Writer is a problem in the first place as you will have a char array (buffer) that's potentially hundreds of MB big! Try rather to use Streams in both ways... If you can pass a custom OutputStream to the library, you could interfere there, but only if it is buffered... However, what does the third-party library return? – Martin Sep 18 '14 at 15:28
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/61504/discussion-between-martin-and-javi-fernandez). – Martin Sep 18 '14 at 15:29
  • "As I'm implementing a Writer" - that comment was the first time we had that piece of context. This is a classic XY question, and a great example of why it's worth including context in your question. I suspect there may well be better approaches... heck, even just dumping to disk and then reading from disk would quite possibly simplify things. – Jon Skeet Sep 18 '14 at 15:37
  • @JonSkeet as I didn't know about line break sequences I thought this question would be clear enough and the answers would go in a different direction – jeojavi Sep 18 '14 at 15:58
  • @JonSkeet I edited my question, I hope is clearer now. PS dumping to disk and reading from disk is not a possibility – jeojavi Sep 18 '14 at 16:14
  • Well somewhat... But you've just added another restriction in comments that isn't in the question... – Jon Skeet Sep 18 '14 at 16:18
  • @JonSkeet which one? – jeojavi Sep 18 '14 at 16:20
  • "dumping to disk and reading from disk is not a possibility" - no explanation, and no mention in the question. – Jon Skeet Sep 18 '14 at 17:34
  • @JonSkeet I really appreciate your comments but excuse me because I don't understand why you need so much information about my project. Isn't the question clear? I was only asking if a line break can be detected from a single character. – jeojavi Sep 18 '14 at 17:45
  • The problem is that the answer to that very specific question isn't necessarily the best answer to your actual task. I realize that it can be very difficult to know how much context to give sometimes. – Jon Skeet Sep 18 '14 at 18:15

3 Answers3

12

Use regex to do the work for you:

if (!String.valueOf(character).matches("."))

Without the DOTALL switch, the dot matches all characters except newlines, which according the documentation includes:

  • A newline (line feed) character ('\n'),
  • A carriage-return character followed immediately by a newline character ("\r\n"),
  • A standalone carriage-return character ('\r'),
  • A next-line character ('\u0085'),
  • A line-separator character ('\u2028'), or
  • A paragraph-separator character ('\u2029).

Note that line break sequences exist, eg \r\n, but you asked about individual characters. The regex solution would work with one or two char inputs.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Thanks for your answer! There are no other line break characters? Does this work for Windows too? – jeojavi Sep 18 '14 at 14:19
  • @JaviFernández I dug deeper - there are others. See answer, and a more complete solution – Bohemian Sep 18 '14 at 14:37
  • @Martin the regex approach will work for sequences too eg `"\r\n"` – Bohemian Sep 18 '14 at 14:41
  • As @Martin comments this only will work for characters and not for sequences, but the list of all possible characters for a line break is one of the things I was looking for, thank you!! – jeojavi Sep 18 '14 at 14:44
  • @JaviFernández, depending on what you need to do, there are some answers that might help you out. Try to be more specific, for example: "Parse some input string without knowing where it comes from (as long as it's Unix/Linux/MAC/Windows)", "Tell where a file comes from based on line separators", etc. Telling us your goal rather than just a part of it might help people give more specific answers. – Martin Sep 18 '14 at 14:49
  • @Martin as I didn't know about line break sequences I thought this question would be clear enough and the answers would go in a different direction, sorry for the inconvenience. – jeojavi Sep 18 '14 at 14:54
  • @Bohemian, the problem would be that for `"\r\n"` you would find two line breaks instead of one, or am I missing something? – Martin Sep 18 '14 at 15:01
  • 1
    @JaviFernández, it's no inconvenience, just a suggestion to make your question as valuable as possible and to get the best possible answer. – Martin Sep 18 '14 at 15:03
1

As I posted in my comments, the line separator is not always a "character", but a sequence of characters, depending on the platform. To be independent it would look like this:

public String[] splitLines(String input) {
    return input.split("(\r\n|\r|\n)");
}

Based on this answer:

Match linebreaks - \n or \r\n?

However, this means regex matching, not char matching... However getting a String out of a buffer should be achievable...

Community
  • 1
  • 1
Martin
  • 3,018
  • 1
  • 26
  • 45
  • thank you very much for your answer @Martin, but creating a String for each buffer is one of the things I was trying to avoid. I will have this answer into account if a go for a different approach. – jeojavi Sep 18 '14 at 14:48
  • That's fine... Maybe you need to look where your char buffer comes from... If you will be using it in a writer, it has to be generated somewhere, I would assume it comes from a String... If you look at the String constructor for (`char[]`), it uses a Arraycopy, so it's not so expensive: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/String.java#String.%3Cinit%3E%28char[]%29. If you need to cover all the cases, you will have to do regex matching rather than character matching... – Martin Sep 18 '14 at 14:56
0

You can get the OS dependent line separator using

System.getProperty("line.separator")

This will return a string.

But since your are trying use char, checking whether char is '\n' or 'r' is correct.

if(yourChar == '\r' || yourChar == '\n')