Word Phrases in Java Regex

Question

I am currently doing a project in my Java class which involves scanning a text file and then breaking each line up with java regular expressions. An example of the one of the lines in the text file is shown below:

I have been trying to break this up so that I can get the word phases like "Ultra Liquid Bleach" and "Mountain Fresh" but not the white spaces in between. My current code that I have so far is:

([\\w]+|[ ]?)\\b

and I cannot get any farther than that. The first two words vary in number of words so any expression that targets a specific number of words will not work. Am I on the right track or is there a better way of doing what I am trying to do?

How can a computer know that you want `Ultra Liquid Bleach`/`Mountain Fresh` as opposed to `Ultra Liquid`/`Bleach Mountain Fresh` if you don't know how many words will come and have provided no other rules for parsing? — Kon, Sep 18 '17 at 17:39
Why "Ultra Liquid Bleach" and "Mountain Fresh" are two phrases? What counts as a phrase? — Sweeper, Sep 18 '17 at 17:41
Maybe this is a tsv? https://stackoverflow.com/questions/19575308/read-a-file-separated-by-tab-and-put-the-words-in-an-arraylist https://stackoverflow.com/questions/18331696/reading-tab-delimited-textfile-java https://stackoverflow.com/questions/14361650/reading-a-tab-separated-file-in-java What about `(.*?)(?:\t|$)` assuming is tsv? — ctwheels, Sep 18 '17 at 17:51

score 0 · Answer 1 · answered Sep 18 '17 at 18:23

You used an image rather than providing us a text-based example, but this should work for you assuming "word phrases" are always separated by 3+ spaces and you would never expect tabs or 3+ spaces within an individual "word phrase"

Assumed input:

Disinfecting Wipes        Lemon Fresh                       35 pkg      3.39
Ultra Liquid Bleach       Mountain Fresh                    96 oz       2.39
FF & LS Broth             Chicken                           32 oz       2.99

Regex:

\b(\S+(?:  ?\S+)*)\b

Explanation (see also: more detail and output of run against assumed input):

\b: Word boundary (zero-width marker between a word char (\w) and a non-word char (\W)
(: Matching group starts here
- \S+: One or more non-space characters
- (?:: Non-matching group starts here
  - ?: A literal space and then 0 or 1 literal spaces
  - \S+: One or more non-space characters
- )*: This non-matching group may be present zero or more times
): end of matching group
\b: Word boundary

Word Phrases in Java Regex

1 Answers1

Assumed input:

Regex: