How to write a regular expressions that extracts tabbed pieces of text?

Question

I have been trying to create a program to replace tab elements with spaces (assuming a tab is equivalent to 8 spaces, one or more of which taken by non-whitespace characters (letter).

I start to extract the text in a file from a scanner by the following:

try {
    reader = new FileReader(file)
} catch (IOException io) {
    println("File not found")
}
Scanner scanner = new Scanner(reader);
scanner.usedelimiter("//Z");
String text = Scanner.next();

And then I try parsing through pieces of text that end with a tab with ptrn1 below, and extract the length of the last word of each piece with ptrn2:

Pattern ptrn1 = Pattern.compile(".*\\t, Pattern.DOTALL);
Matcher matcher1 = ptrn1.matcher(text);
String nextPiece = matcher1.group();
println(matcher1.group()); /* gives me the first substring ending with tab*/

however:

Pattern ptrn2 = Pattern.compile("\\s.*\\t"); /*supposed to capture the last word in the string*/
Matcher matcher2 = ptrn2.matcher(nextPiece);
String lastword = matcher2.group();

The last line gives me an error since apparently it cannot match anything with the pattern ("\\s.\*\\t"). There is something wrong with this last regular expression, which is intended to say "any number of spaces, followed by any number of characters, followed by a tab. I have not been able to find out what is wrong with it though. I have tried ("\\s*.+\\t"), ("\\s*.*\\t"), and ("\s+.+\\t"); still no luck.

Later on, per recommendations below, I simplified the code and included the sample string in it. As follows:

       import acm.program.*;
       import acm.util.*;
       import java.util.*;
       import java.io.*;
       import java.util.regex.*;

    public class Untabify extends ConsoleProgram {
        public void run(){
            String s = "Be plain,\tgood son,\tand homely\tin thy drift.\tRiddling\tconfession\tfinds but riddling\tshrift. ";            
                Pattern ptrn1 =Pattern.compile(".*?\t", Pattern.DOTALL);
                Pattern ptrn2 = Pattern.compile("[^\\s+]\t", Pattern.DOTALL);

                String nextPiece;

                Matcher matcher1 = ptrn1.matcher(s);

                while (matcher1.find()){                
                    nextPiece = matcher1.group();
                    println(nextPiece);
                    Matcher matcher2 = ptrn2.matcher(nextPiece);
                    println(matcher2.group());

               }
            }
    }

The program variably crashes, first at "println(matcher2.group())"; and on the next run on "public void run()" with the message: "Debug Current Instruction Pointer" (what is the meaning of it?).

What text are you trying to match? – Mike B Jan 06 '14 at 22:24 — Mike B, Jan 06 '14 at 22:24
The last word before the tab in a string. – Kambiz Jan 06 '14 at 22:55 — Kambiz, Jan 06 '14 at 22:55

score 1 · Answer 1 · edited May 23 '17 at 12:11

1

You do not need to double-escape the tab character (i.e. \\t); \t will do fine. \t is interpreted as a tab character by the java String parser, and that tab character is sent to the regex parser, which interprets it as a tab character. You can see this answer for more information.

Also, you should use Pattern.DOTALL, not Pattern.Dotall.

edited May 23 '17 at 12:11

Community

1
1

answered Jan 06 '14 at 22:30

The Guy with The Hat

10,836
8
57
75

Rangi Keen · Answer 2 · 2014-01-07T20:39:50.710

1

The pattern "\\s.*\\t" must match a single whitespace character (\s) followed by 0 or more characters (.*) followed by a single tab (\t). If you want to capture the last word and a trailing tab you should use the word boundary escape \b

Pattern.compile("\\b.*\\b\t");

You could replace the . above to use \w or whatever your definition of a word character is if you don't want to match any character.

Here's the code you'd use to match any word immediately before a tab:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegEx {
    public static void main(String args[]) {
        String text = "ab cd\t ef gh\t ij";
        Pattern pattern = Pattern.compile("\\b(\\w+)\\b\t", Pattern.DOTALL);
        Matcher matcher = pattern.matcher(text);
        while (matcher.find()) {
            System.out.println(matcher.group(1));
        }
    }
}

The above will output

cd
gh

See the Regular Expression Tutorial, especially the sections on Predefined Character Classes and Boundary Matchers for more information.

You can get more detail and experiment with this regular expression on Regex101.

edited Jan 07 '14 at 20:39

answered Jan 06 '14 at 22:34

Rangi Keen

935
9
29

+1, but you don't need to double-escape the tab character (as stated in my answer). – The Guy with The Hat Jan 06 '14 at 22:45
Thanks. What you said made sense. I change ptrn2 = Pattern.compile ("//b.*//b/t"); and still when I test it with the statement: println(matcher2.matches()); I get "false" as a result for every string ending with tab. I still don't know why I should get false even with the previous pattern I used. – Kambiz Jan 06 '14 at 22:58
1

@RyanCarlson Thanks, I changed the tab character to be single escaped after seeing your answer but I kept the double escape on the first pattern since that is a reference to the original post. – Rangi Keen Jan 07 '14 at 14:02
@Kambiz If you can add some sample strings to your post with expected output I can better help. Note that it may be due to the incorrect slashes in your pattern from your comment above. They should be backslashes (``\``) not forward slashes (`/`). – Rangi Keen Jan 07 '14 at 14:04
@Kambiz `Matcher.matches()` will only return true if the entire input string matches the regular expression. If you want to find substrings, you should use `Matcher.find()`. If it returns true, you can then use `Matcher.group()` to get the substring that was matched by the last match. – Rangi Keen Jan 07 '14 at 15:27

score 1 · Answer 3 · answered Jan 06 '14 at 22:35

1

It would be useful to see a sample string. If you just want the last word before the tab, then you can use this:

([^\s]+)\t

Note the () are to put the last word in a group. [^\s]+ means 1 or more non-space.

answered Jan 06 '14 at 22:35

acarlon

16,764
7
75
94

I used this patter too ([^\s]+)\t, and then when I use matcher2.group(), or matcher2.group(1), I don't get anything. When I do matcher2.matches(), it returns false. – Kambiz Jan 06 '14 at 23:14
@Kambiz - do you have an example of the string you are trying to match? Even with your regex it should match. – acarlon Jan 06 '14 at 23:19
@acarlon: this is the string example I have been using: Be plain,/t good son,/t and homely in thy/t drift. Riddling/t confession/t finds but riddling/t shrift. – Kambiz Jan 06 '14 at 23:32
@acarlon: and you are right, even with my regex it should work, which is why I find it so perplexing. – Kambiz Jan 06 '14 at 23:39
@Kambiz - all I can think is that the string that you are reading in is wide char or some other type of literal encoding. Have you tried putting that example string in a literal string and testing. E.g. String testString = "Be plain,\t good son,\t and homely in thy\t drift. Riddling\t confession\t finds but riddling\t shrift". Also, I think that it should be \t, not /t which you had in your example. – acarlon Jan 06 '14 at 23:44
@acarlon: Yes I meant \t in all the above and not /t, sorry for that. The string is read from a text file. These are actual tabs and I used a symbols because tabs are not typed here. The first part of matching (using ptrn1, matcher1) goes through correctly. The debugger shows that for example, in the first iteration, the value of next piece is set to "Be plan,/t". This looks like a literal string, which is used as an argument to ptrn2.matcher(nextpiece). However, like I ve mentioned, matcher2.group() gives an error, and matcher2.matches() returns false. – Kambiz Jan 07 '14 at 00:14
@Kambiz - yeah, that is strange. All I can suggest at this point is to try to narrow the problem down. If you could put the code up using a literal string (not from file) showing the problem at http://ideone.com/ we would be able to see it happening. Otherwise, it is tricky because it works for us. – acarlon Jan 07 '14 at 00:23
@acarlon - I simplified the code (see the addition at the bottom of my question). Still does nt go through. I will start using ideone.com, seems like a useful tool. – Kambiz Jan 07 '14 at 18:38
@Kambiz - I get the same. Unfortunately, I don't have time to look into it right now. However, you can just extract the tabbed string and without tabs in one step. See: http://ideone.com/51q7vM – acarlon Jan 07 '14 at 21:14

How to write a regular expressions that extracts tabbed pieces of text?

3 Answers3