-1

I am currently doing a project in my Java class which involves scanning a text file and then breaking each line up with java regular expressions. An example of the one of the lines in the text file is shown below:

Picture of text

I have been trying to break this up so that I can get the word phases like "Ultra Liquid Bleach" and "Mountain Fresh" but not the white spaces in between. My current code that I have so far is:

([\\w]+|[ ]?)\\b

and I cannot get any farther than that. The first two words vary in number of words so any expression that targets a specific number of words will not work. Am I on the right track or is there a better way of doing what I am trying to do?

PM 77-1
  • 12,933
  • 21
  • 68
  • 111
Justin Do
  • 1
  • 1
  • 2
    How can a computer know that you want `Ultra Liquid Bleach`/`Mountain Fresh` as opposed to `Ultra Liquid`/`Bleach Mountain Fresh` if you don't know how many words will come and have provided no other rules for parsing? – Kon Sep 18 '17 at 17:39
  • 1
    Why "Ultra Liquid Bleach" and "Mountain Fresh" are two phrases? What counts as a phrase? – Sweeper Sep 18 '17 at 17:41
  • What separates your columns? – PM 77-1 Sep 18 '17 at 17:44
  • Maybe this is a tsv? https://stackoverflow.com/questions/19575308/read-a-file-separated-by-tab-and-put-the-words-in-an-arraylist https://stackoverflow.com/questions/18331696/reading-tab-delimited-textfile-java https://stackoverflow.com/questions/14361650/reading-a-tab-separated-file-in-java What about `(.*?)(?:\t|$)` assuming is tsv? – ctwheels Sep 18 '17 at 17:51

1 Answers1

0

You used an image rather than providing us a text-based example, but this should work for you assuming "word phrases" are always separated by 3+ spaces and you would never expect tabs or 3+ spaces within an individual "word phrase"

Assumed input:

Disinfecting Wipes        Lemon Fresh                       35 pkg      3.39
Ultra Liquid Bleach       Mountain Fresh                    96 oz       2.39
FF & LS Broth             Chicken                           32 oz       2.99

Regex:

\b(\S+(?:  ?\S+)*)\b

Explanation (see also: more detail and output of run against assumed input):

  • \b: Word boundary (zero-width marker between a word char (\w) and a non-word char (\W)
  • (: Matching group starts here
    • \S+: One or more non-space characters
    • (?:: Non-matching group starts here
      • ?: A literal space and then 0 or 1 literal spaces
      • \S+: One or more non-space characters
    • )*: This non-matching group may be present zero or more times
  • ): end of matching group
  • \b: Word boundary
Adam Katz
  • 14,455
  • 5
  • 68
  • 83