1

I have a text file that contain large amount of data. Below shows some part of the data. I am required to create a separate European subset file. How do I filter them out using awk?

File columns are as follows: User ID, Latitude, Longitude, Venue category name, Country code(2-letter)

Text file containing:

3fd66200f964a52008e61ee3    40.726589   -73.995649  Deli / Bodega   US
4eaef4f4722e4efd614ddb80    51.515470   -0.148605   Burger Joint    GB
4eaef8325c5c7964424125c8    50.739561   4.253660    Vineyard    BE
4e210f60d22d0a3f59f4cbfb    5.367963    103.097516  Racetrack   MY
52373a6511d2d4fcba683886    41.434926   2.220326    Medical Center  ES
476f8da1f964a520044d1fe3    40.695163   -73.995448  Thai Restaurant US

New text file should look like this:

4eaef4f4722e4efd614ddb80 51.515470 -0.148605 Burger Joint GB 4eaef8325c5c7964424125c8 50.739561 4.253660 Vineyard BE 52373a6511d2d4fcba683886 41.434926 2.220326 Medical Center ES

Note: I can either user latitude longitude bounding box or country code to extract the subset into a new file.

James Brown
  • 36,089
  • 7
  • 43
  • 59
  • There are a lot more GB, BE, ES and other European countries in the dataset. This is just a small part that I took from the dataset itself. – Kelvin Chew Oct 07 '16 at 08:09

2 Answers2

4

First you need the country codes for the required countries (or all the latitudes and longitudes and corresponding country codes :) in a separate file to check against:

$ cat countries.txt
GB
BE
ES

In awk:

$ awk 'NR==FNR{a[$0];next} $NF in a' countries.txt file.txt
4eaef4f4722e4efd614ddb80    51.515470   -0.148605   Burger Joint    GB
4eaef8325c5c7964424125c8    50.739561   4.253660    Vineyard    BE
52373a6511d2d4fcba683886    41.434926   2.220326    Medical Center  ES

Explained:

NR==FNR {  # this block {} is only processed for the first file (take it for granted)
    a[$0]    # this initializes an array element in a, for example a["GB"]
    next     # since we only initialize an element for each country code in the first file
             # no need to process code beyond this point, just skip to NEXT country code
}          # after this point we check whether country code exists in array a
$NF in a     # if element in array a[] for value $NF in last field NF (for example a["GB"])
             # of second file was initialized, it is required row and is printed.
             # this could've been written: { if($NF in a) print $0 }
James Brown
  • 36,089
  • 7
  • 43
  • 59
0

Using grep:

grep -wFf countries.txt file.txt

Explanation of options:

  • -F fixed string search (no regex)
  • -f specifies a file of patterns
  • -w matches whole words only
Chris Koknat
  • 3,305
  • 2
  • 29
  • 30