How to prevent CR/LF?

Question

I am reading from a pdf using pdfbox and apparently, at least on a Windows-based framework, for the line break it uses a unicode as such &#10.

My question is that how can I prevent this line breaking character to be concatenated to the string in below code?

tokenizer =new StringTokenizer(Text,"\\.");
while(tokenizer.hasMoreTokens())
{
    String x= tokenizer.nextToken();
    flag=0;
    for(final String s :x.split(" ")) {
       if(flag==1)
          break;
       if(Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
          sum+=x+"."; //here need first to check for "&#13;&#10"
                      // before concatenating the String "x" to String "sum"
          flag=1;
       }
   }
}

!"".equals(s) -> s.isEmpty() ?! Use StringBuilder instead of sum+=x+"."; — Tokazio, Mar 31 '16 at 15:34
Try `x.trim()` to remove whitespaces at start and end, then check `x.isEmpty()` — fafl, Mar 31 '16 at 15:35
@Tokazio how am I supposed to use it? can you give a short example? — user3049183, Mar 31 '16 at 15:38
StringBuilder sum = new StringBuilder(); before loop then sum.append(x).append("."); and end with sum.toString() — Tokazio, Mar 31 '16 at 15:43
Found text = text.replace("\n", "").replace("\r", ""); in another SO post. (http://stackoverflow.com/questions/2163045/how-to-remove-line-breaks-from-a-file-in-java) — Tokazio, Mar 31 '16 at 15:47
You might want to consider using `Keyword.equalsIgnoreCase(s)` instead of `Keyword.toLowerCase().equals(s.toLowerCase())`. And if `Keyword` is not `""`, then the `&& !"".equals(s)` is superfluous. — AJNeufeld, Mar 31 '16 at 16:01

Stephen C · Accepted Answer · 2016-03-31T16:08:55.247

2

You should discard the line separators when you split; e.g.

for (final String s : x.split("\\s+")) {

That is making the word separator one or more whitespace characters.

(Using trim() won't work in all cases. Suppose that x contains "word\r\nword". You won't split between the two words, and s will be "word\r\nword" at some point. Then s.trim() won't remove the line break characters because they are not at the ends of the string.)

UPDATE

I just spotted that you are actually appending x not s. So you also need to do something like this:

sum += x.replaceAll("\\s+", " ") + "."

That does a bit more than you asked for. It replaces each whitespace sequence with a single space.

By the way, your code would be simpler and more efficient if you used a break to get out of the loop rather than messing around with a flag. (And Java has a boolean type ... for heavens sake!)

   if (Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
       sum += ....
       break;
   }

edited Mar 31 '16 at 16:08

answered Mar 31 '16 at 15:38

Stephen C

698,415
94
811
1,216

unfortunately, it still includes the line breaking feed. – user3049183 Mar 31 '16 at 15:40
@AJNeufeld - That won't help. `\r` and `\n` are members of the `\s` character class. – Stephen C Mar 31 '16 at 15:53
didnt I use a break? i know Java has a Boolean type so I regarded an integer of value`1` as true and `0` for false. Is that wrong? I mean, I used an integer as a Boolean for value of 1 as true and 0 for false just like what a Boolean dose. I know `int` wastes a lot of memory space but this is not a commercial product so I guess that will not harm anything? if this is some serious mistake, you might point it out to me now. – user3049183 Mar 31 '16 at 16:04
Re `break`: yes you did ... but not in the right place. Re: the use of `boolean` instead of `int`: use it for clarity, not efficiency. The fact it is not commercial is beside the point. The point is that other people (like us) need to read your code. – Stephen C Mar 31 '16 at 16:10
Alright. But can you tell me where the right way for using break is? I mean in the context of my code. – user3049183 Mar 31 '16 at 16:24
Oh my, I understood now, I just made some unnecessary coding. That is what happens when you are not a perfect programmer. – user3049183 Apr 01 '16 at 02:48

AJNeufeld · Answer 2 · 2016-03-31T16:25:10.893

0

Are you sure you want to be adding x here?

if(Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
    sum+=x+"."; //here need first to check for "&#13;&#10"
                // before concatenating the String "x" to String "sum"
    flag=1;
}

Don't you want s?

    sum += s + ".";

UPDATE

Oh, I see. So what you really want is something more like:

tokenizer = new StringTokenizer(Text,"\\.");
Pattern KEYWORD = Pattern.compile("\\b"+Keyword+"\\b", Pattern.CASE_INSENSITIVE);
StringBuilder sb = new StringBuilder(sum);
while(tokenizer.hasMoreTokens())
{
    String x = tokenizer.nextToken();
    if (KEYWORD.matcher(x).find()) {
        sb.append(x.replaceAll("\\s+", " ")).append('.');
    }
}
sum = sb.toString();

(Assuming Keyword starts and ends with letters, and doesn't itself contain any RegEx codes)

edited Mar 31 '16 at 16:25

answered Mar 31 '16 at 15:52

AJNeufeld

8,526
1
25
44

no, I am sure I want `x` because `s` is a token of it and when the `s` is equal to a keyword, i want to save the whole `x` and not the `s` cause that is useless. – user3049183 Mar 31 '16 at 15:53
Well, `x.split(...)` does not modify `x`, so any new line characters in `x` will still be in it. You'll need to use a solution like @Stephen has already posted. – AJNeufeld Mar 31 '16 at 15:56

How to prevent CR/LF?

2 Answers2