Filter lines that have only alphabets in first column

Question

I am working with google english 1gram dataset link here, it looks like the following:

C'ape   1804    1       1
C'ape   1821    1       1
C'ape   1826    1       1
C'ape   1838    2       2
C'ape   1844    1       1
C'ape   1869    1       1
C'ape   1874    1       1
C'ape   1878    2       2
C'ape   1879    1       1
C'ape   1880    1       1
CABMEL  1873    1       1
CABMEL  1874    1       1
CABMEL  1875    1       1
CABMEL  1879    1       1
CABMEL  1884    1       1
CABMEL  1890    1       1
CABMEL  1899    1       1
CABMEL  1901    1       1
CABMEL  1903    3       2
CABMEL  1910    2       2
CABMEL  1912    1       1
CABMEL  1915    1       1
CABMEL  1926    2       2
CABMEL  1927    3       2
CABMEL  1928    4       2
CABMEL  1930    2       2

At least 4 columns, and some rows also contain 5. First column is a 1-gram, a string, I want to extract only those lines which have a string in first column that only contains letters (upper case or lower case alphabets only). I am thinking grep should do it but I cannot find the correct regex to do this job. Any unix utilty that can easily get the job done? Columns are tab delimited I believe.

EDIT: Output will contain only the lines with CABMEL

It is a bigger file, I just posted a few records here. I gave the link of the full file in the question. — Wajahat, Nov 07 '15 at 10:10
Note: The file uses tab characters as column delimiter. See my answer below. — Joe, Nov 07 '15 at 10:22

Andreas Louv · Accepted Answer · 2015-11-07T10:36:04.913

Using Perl:

# Match all lines that start with a-z or A-Z and are followed by a space
perl -ne 'print if m/^[a-z]+\s/i' file

Using awk:

# Match first field's that only contain a-z or A-Z
awk '$1 ~ /^[a-zA-Z]+$/' file

Both will output:

CABMEL  1873    1       1
CABMEL  1874    1       1
CABMEL  1875    1       1
CABMEL  1879    1       1
CABMEL  1884    1       1
CABMEL  1890    1       1
CABMEL  1899    1       1
CABMEL  1901    1       1
CABMEL  1903    3       2
CABMEL  1910    2       2
CABMEL  1912    1       1
CABMEL  1915    1       1
CABMEL  1926    2       2
CABMEL  1927    3       2
CABMEL  1928    4       2
CABMEL  1930    2       2

Joe · Answer 2 · 2015-11-07T10:18:58.423

3

grep -iE '^[a-z]+\s' file

should do. Now uses \s to match the whitespace (the file uses tab as delimiter).

edited Nov 07 '15 at 10:18

answered Nov 07 '15 at 09:59

Joe

877
1
11
26

Okay you want to match columns with `'` as well. I edited the answer. – Joe Nov 07 '15 at 10:07
No I do not want the apostrophe, and it still does not work. – Wajahat Nov 07 '15 at 10:08
Although, if you check the outputs from dev-null's answer and your answer, and use diff, your answer seems to included some extra lines. – Wajahat Nov 07 '15 at 11:03
Yeah, it seems that dev-null's Perl solution omits words containing diacritics. You can achieve the same with `grep` by using `-iP` instead of `-iE`. dev-null's `awk` solution provides a larger list which contains words with diacritics and also `ß` ligatures. – Joe Nov 07 '15 at 12:34

Filter lines that have only alphabets in first column

2 Answers2

Linked