Creating subset from a text file for matching characters using awk

Question

I have a text file that contain large amount of data. Below shows some part of the data. I am required to create a separate European subset file. How do I filter them out using awk?

File columns are as follows: User ID, Latitude, Longitude, Venue category name, Country code(2-letter)

Text file containing:

3fd66200f964a52008e61ee3    40.726589   -73.995649  Deli / Bodega   US
4eaef4f4722e4efd614ddb80    51.515470   -0.148605   Burger Joint    GB
4eaef8325c5c7964424125c8    50.739561   4.253660    Vineyard    BE
4e210f60d22d0a3f59f4cbfb    5.367963    103.097516  Racetrack   MY
52373a6511d2d4fcba683886    41.434926   2.220326    Medical Center  ES
476f8da1f964a520044d1fe3    40.695163   -73.995448  Thai Restaurant US

New text file should look like this:

4eaef4f4722e4efd614ddb80 51.515470 -0.148605 Burger Joint GB 4eaef8325c5c7964424125c8 50.739561 4.253660 Vineyard BE 52373a6511d2d4fcba683886 41.434926 2.220326 Medical Center ES

Note: I can either user latitude longitude bounding box or country code to extract the subset into a new file.

There are a lot more GB, BE, ES and other European countries in the dataset. This is just a small part that I took from the dataset itself. — Kelvin Chew, Oct 07 '16 at 08:09

James Brown · Accepted Answer · 2016-10-07T09:43:27.423

4

First you need the country codes for the required countries (or all the latitudes and longitudes and corresponding country codes :) in a separate file to check against:

$ cat countries.txt
GB
BE
ES

In awk:

$ awk 'NR==FNR{a[$0];next} $NF in a' countries.txt file.txt
4eaef4f4722e4efd614ddb80    51.515470   -0.148605   Burger Joint    GB
4eaef8325c5c7964424125c8    50.739561   4.253660    Vineyard    BE
52373a6511d2d4fcba683886    41.434926   2.220326    Medical Center  ES

Explained:

NR==FNR {  # this block {} is only processed for the first file (take it for granted)
    a[$0]    # this initializes an array element in a, for example a["GB"]
    next     # since we only initialize an element for each country code in the first file
             # no need to process code beyond this point, just skip to NEXT country code
}          # after this point we check whether country code exists in array a
$NF in a     # if element in array a[] for value $NF in last field NF (for example a["GB"])
             # of second file was initialized, it is required row and is printed.
             # this could've been written: { if($NF in a) print $0 }

edited Oct 07 '16 at 09:43

answered Oct 07 '16 at 08:58

James Brown

36,089
7
43
59

I still don't understand on how to use 'NR==FNR{a[$0];next} $NF in a'. Can you explain explain more or show me some examples? – Kelvin Chew Oct 07 '16 at 09:26
Updated explanation. – James Brown Oct 07 '16 at 09:35
1

Thank you so much! – Kelvin Chew Oct 07 '16 at 09:48
don't get the trick for the first block, and I'm not satisfied with the "take it for granted" :D – Aif Oct 07 '16 at 09:49
@Aif I hope you find this post helpful: http://stackoverflow.com/a/32482115/4162356 The explanation just would've been repetition plus it wouldn't have fit in the comments anymore. – James Brown Oct 07 '16 at 09:56

score 0 · Answer 2 · answered Oct 07 '16 at 20:39

0

Using grep:

grep -wFf countries.txt file.txt

Explanation of options:

-F fixed string search (no regex)
-f specifies a file of patterns
-w matches whole words only

answered Oct 07 '16 at 20:39

Chris Koknat

3,305
2
29
30

Creating subset from a text file for matching characters using awk

2 Answers2