-1

I want to find and list lines in text file that contain only two words that are four characters or more.

I can find words of four characters or more with:

grep '[A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z]*' file.txt

but how can I limit output to show only lines with two such words?

Any hints (not necessarily an answer)?

thanks

UPDATE: Thank you. After following your advice I'm now with:

egrep '([A-Za-z]){4,}' file.txt

That lists all the lines with highlighted words that are 4+ letters long. Now I have only to filter it to show only the lines where such words (4+ letters long) occur twice. Any hints?

M.Chelm
  • 61
  • 6
  • 1
    adding a few sample lines(say 3-5) and expected output would help to add clarity and testing purposes.. it'd also help to know how the words are separated - space or something else? – Sundeep Sep 28 '18 at 15:34
  • 1
    `egrep` is deprecated, use `grep -E` instead, the parens around the bracket expression `([...])` are redundant, and `[A-Za-z]` doesn't necessarily contain all letters, which I think is probably what you want. See [this recent question](https://stackoverflow.com/q/52570103/1745001) for example. Use the character class `[:alpha:]` instead of `A-Za-z`, i.e. instead of `egrep '([A-Za-z]){4,}'` you should use `grep -E '[[:alpha:]]{4,}'`. – Ed Morton Sep 30 '18 at 15:06
  • Just to be clear - finding lines with 2 **or more** such words is trivial, finding lines with **exactly** 2 such words using only standard grep is the thing that's hard so make sure when you post your sample input and expected output to include in the input lines with more than 2 words of 4+ letters like `foo stuff and things with bar` to make sure those are **not** output. Also make sure your input contains punctuation and other chars/strings that you could imagine a tool might struggle to categorize correctly. – Ed Morton Sep 30 '18 at 15:19

3 Answers3

1

To look for two instances of PATTERN, use:

PATTERN.*PATTERN

If you use grep -E you could use curly braces to avoid repetition:

grep -E '(.*PATTERN){2,}'

(You could also apply the same trick to avoid repeating [A-Za-z] in your pattern.)

You can use \< and \> to match the beginning and end of words to make sure 8-letter words aren't detected as two 4-letter words.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • you'd need to include some sort of word separator too.. otherwise a single long word line would match as well... – Sundeep Sep 28 '18 at 15:36
  • @John Kugelman, when I apply the first instance to my pattern it highlights words of four characters of more and everything in between. When I apply the second instance it doubles my pattern and lists words of eight characters or more. – M.Chelm Sep 28 '18 at 15:37
  • `PATTERN.*PATTERN` looks for **2 or more** instances of `PATTERN`. The tricky part of this questions is to match on a line that is `PATTERN foo PATTERN` but not match on a line that is `PATTERN foo PATTERN bar PATTERN`. – Ed Morton Sep 30 '18 at 16:09
1

Just use awk so you don't have to come up with some convoluted regexp to do everything at once. With GNU awk for word boundaries and assuming your "words" only contain alphabetic characters as in your posted script:

awk 'gsub(/\<[[:alpha:]]{4,}\>/,"&") == 2'

The above is untested, of course, since you didn't provide sample input/output for us to test against.

EDIT: Here's the solution given on page 216 in the text you referenced in your comments to exercise 7.5 on page 100 which you based your question on:

egrep '(\<[A-Za-z]{4,}\>).*\<\1\>' file

Let's first clean that up to remove the deprecated egrep and replace the character lists with a portable character class:

grep -E '(\<[[:alpha:]]{4,}\>).*\<\1\>' file

Now what you have is a script that rather than looking for lines that contain only two words that are four characters or more as stated in your question, looks for lines that contain the same 4-or-more character word occurring at least two times which is a very different and much simpler problem to solve.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • it's an exercise from book and awk wasn't mentioned at this point, I would like to stick with grep – M.Chelm Sep 30 '18 at 13:18
  • I really don't know how you're going to be able to do this job with just a call to standard grep and if the exercise is from a book and doesn't allow awk then I assume the solution also doesn't involve a shell script with multiple calls to grep, and the solution doesn't involve GNU grep and its proprietary (and highly experimental according to [the man page](https://www.gnu.org/software/grep/manual/grep.html)) -P extension for PCREs. So I'll be very interested to see what the eventual solution is. [edit] your question to include concise, testable sample input and expected output to get help. – Ed Morton Sep 30 '18 at 14:56
  • It's 7.5 exercise (page 100) from this document [link](https://www.linuxcertification.co.za/sites/default/files/linux-esentials-manual.pdf). Maybe I interpet it wrongly. – M.Chelm Oct 01 '18 at 18:57
  • 1
    That's an extremely ambiguous question. It could mean the same word occurs twice or any words occur twice or it could mean at least twice or something else. I also see in the examples that book tells you to use `egrep` and `fgrep` instead of `grep -E` and `grep -F` and to do things like `grep \ frog.txt` with no quotes around the argument when you should always quote arguments to commands and they use the term "word bracket" when they actually mean "word boundary", so I'd treat anything in that book with a large grain of salt. – Ed Morton Oct 01 '18 at 20:38
  • 1
    The statement `There are two varieties of grep : Traditionally, the stripped-down fgrep (“fixed”) Varieties does not allow regular expressions—it is restricted to character strings—but is very fast. egrep (“extended”) offers additional regular expression operators, but is a bit slower and needs more memory.` is wrong and oddly phrased -. `grep` searches for BREs, `grep -F` searches for strings, `grep -E` searches for EREs, and `grep -P` (GNU grep) searches for PCREs. Speed of execution is really irrelevant since you should simply use the one you NEED to use for the match you want. – Ed Morton Oct 01 '18 at 20:46
  • I looked up the solution they gave in that book and updated my answer to discuss it. – Ed Morton Oct 01 '18 at 21:33
  • 1
    Thank you very much – M.Chelm Oct 02 '18 at 16:17
0

1st: I recommend using \w (letter) for letter, it's cleaner.
2nd: To group your pattern into a single token use () to find multiple copies of a regex token use {}. (see Cheat sheet)
3rd: In this case your delimiter is whitespace so I'd use \s since I assume you might want to catch things like tabs. But that's at your own discretion.

Side note: I recommend avoiding * unless you have a strong delimiter (e.g. .* will greedy match to the end of your string).

Cheat sheet: https://www.rexegg.com/regex-quickstart.html

  • 1
    There are no standard UNIX tools that would understand `\w` or `\s` so YMMV there (I expect most GNU versions of the tools would though) and in the tools that do support `\w` it does not mean letter, it means word-constituent character which includes letters, numbers, and underscore. I don't know how to write a regexp that uses `{}` to specify repetitions of non-contiguous groups of 4-char strings but maybe others could figure it out (maybe with PCRE look-ahead/behind?). The OP also didn't say anything about her delimiters being white space - they could be punctuation for all we know so far. – Ed Morton Sep 29 '18 at 21:06