-2

I am working with google english 1gram dataset link here, it looks like the following:

C'ape   1804    1       1
C'ape   1821    1       1
C'ape   1826    1       1
C'ape   1838    2       2
C'ape   1844    1       1
C'ape   1869    1       1
C'ape   1874    1       1
C'ape   1878    2       2
C'ape   1879    1       1
C'ape   1880    1       1
CABMEL  1873    1       1
CABMEL  1874    1       1
CABMEL  1875    1       1
CABMEL  1879    1       1
CABMEL  1884    1       1
CABMEL  1890    1       1
CABMEL  1899    1       1
CABMEL  1901    1       1
CABMEL  1903    3       2
CABMEL  1910    2       2
CABMEL  1912    1       1
CABMEL  1915    1       1
CABMEL  1926    2       2
CABMEL  1927    3       2
CABMEL  1928    4       2
CABMEL  1930    2       2

At least 4 columns, and some rows also contain 5. First column is a 1-gram, a string, I want to extract only those lines which have a string in first column that only contains letters (upper case or lower case alphabets only). I am thinking grep should do it but I cannot find the correct regex to do this job. Any unix utilty that can easily get the job done? Columns are tab delimited I believe.

EDIT: Output will contain only the lines with CABMEL

Wajahat
  • 1,593
  • 3
  • 20
  • 47
  • It is a bigger file, I just posted a few records here. I gave the link of the full file in the question. – Wajahat Nov 07 '15 at 10:10
  • Note: The file uses tab characters as column delimiter. See my answer below. – Joe Nov 07 '15 at 10:22

2 Answers2

4

Using Perl:

# Match all lines that start with a-z or A-Z and are followed by a space
perl -ne 'print if m/^[a-z]+\s/i' file

Using awk:

# Match first field's that only contain a-z or A-Z
awk '$1 ~ /^[a-zA-Z]+$/' file

Both will output:

CABMEL  1873    1       1
CABMEL  1874    1       1
CABMEL  1875    1       1
CABMEL  1879    1       1
CABMEL  1884    1       1
CABMEL  1890    1       1
CABMEL  1899    1       1
CABMEL  1901    1       1
CABMEL  1903    3       2
CABMEL  1910    2       2
CABMEL  1912    1       1
CABMEL  1915    1       1
CABMEL  1926    2       2
CABMEL  1927    3       2
CABMEL  1928    4       2
CABMEL  1930    2       2
Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
3
grep -iE '^[a-z]+\s' file

should do. Now uses \s to match the whitespace (the file uses tab as delimiter).

Joe
  • 877
  • 1
  • 11
  • 26
  • Okay you want to match columns with `'` as well. I edited the answer. – Joe Nov 07 '15 at 10:07
  • No I do not want the apostrophe, and it still does not work. – Wajahat Nov 07 '15 at 10:08
  • Although, if you check the outputs from dev-null's answer and your answer, and use diff, your answer seems to included some extra lines. – Wajahat Nov 07 '15 at 11:03
  • Yeah, it seems that dev-null's Perl solution omits words containing diacritics. You can achieve the same with `grep` by using `-iP` instead of `-iE`. dev-null's `awk` solution provides a larger list which contains words with diacritics and also `ß` ligatures. – Joe Nov 07 '15 at 12:34