2

I am somewhat new to the Linux environment. I looked all over for an answer to this -- apologies if this has been asked before.

I wrote an awk script that operates on a big text file (11 gigs, 40 columns, 48M rows). The script is called "cycle.awk." It replace a column with a new version of it. It requires the data to be sorted first by that particular column. In order to run the script on all the columns, I wrote a bash command like this:

cat input.csv |
    sort -k 22 -t "," | awk -v val=22 -f cycle.awk |
    sort -k 23 -t "," | awk -v val=23 -f cycle.awk |
    sort -k 24 -t "," | awk -v val=24 -f cycle.awk |
    sort -k 25 -t "," | awk -v val=25 -f cycle.awk |
    sort -k 26 -t "," | awk -v val=26 -f cycle.awk |
    sort -k 27 -t "," | awk -v val=27 -f cycle.awk |
    sort -k 28 -t "," | awk -v val=28 -f cycle.awk |
    sort -k 29 -t "," | awk -v val=29 -f cycle.awk |
    sort -k 30 -t "," | awk -v val=30 -f cycle.awk |
    sort -k 31 -t "," | awk -v val=31 -f cycle.awk |
    sort -k 32 -t "," | awk -v val=32 -f cycle.awk |
    sort -k 33 -t "," | awk -v val=33 -f cycle.awk |
    sort -k 34 -t "," | awk -v val=34 -f cycle.awk |
    sort -k 35 -t "," | awk -v val=35 -f cycle.awk |
    sort -k 36 -t "," | awk -v val=36 -f cycle.awk |
    sort -k 37 -t "," | awk -v val=37 -f cycle.awk |
    sort -k 38 -t "," | awk -v val=38 -f cycle.awk |
    sort -k 39 -t "," | awk -v val=39 -f cycle.awk |
    sort -k 40 -t "," | awk -v val=40 -f cycle.awk |
    sort -k 41 -t "," | awk -v val=41 -f cycle.awk > output.csv

I figure there must be a more elegant way to do this. How can I write a bash script that will allow me to pass the columns I want to apply my awk script and then run this kind of piping procedure without needing to produce any temporary data files? I am avoiding temporary files because the input file is so large and I am interested in optimal performance.

BTW, the script is as follows. It basically shortens the values of some columns for purposes of compressing the text file. Any pointers on how to tighten it up? This procedures takes about 10 hours to run.

BEGIN{ FS=","; OFS=","; count=1 }
NR == 1 { temp=$val }
{
    if ( temp != $val ) {
        temp=$val;
        count++;
    }
    $val=count
    print $0
}

Input typically looks something like this:

id,c1
1,abcd
2,efgh
3,abcd
4,abcd
5,efgh

where the corresponding output would be:

id,c1
1,1
2,2
3,1
4,1
5,2

Technically, it would be sorted by c1 but that's not the point.

Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
Nick
  • 43
  • 6
  • Can you provide a few lines of sample input, and the output you'd like to see? – glenn jackman Aug 08 '14 at 21:47
  • Pipe into a recursive function. See http://stackoverflow.com/questions/9898939/handling-long-edit-lists-in-xmlstarlet for an example. – Charles Duffy Aug 08 '14 at 21:53
  • As far as I can tell, your awk script contains one print statement, which prints the original input line and executes unconditionally on every line of input, and your script is therefore equivalent to `cat`. Thus your entire command line is equivalent to `sort -k 41 -t "," < input.csv > output.csv`. What did I misunderstand? – rob mayoff Aug 08 '14 at 22:00
  • 1
    By the way -- when it's run on content too large for memory, (the GNU implementation of) `sort` itself can create temporary files, so you're not necessarily avoiding them here. – Charles Duffy Aug 08 '14 at 22:01
  • Ahh. Given the algorithm, sorting N different ways really does need to happen, then. – Charles Duffy Aug 08 '14 at 22:15
  • @Nick How many different values occur in each column of your input? How much RAM does your computer have? – rob mayoff Aug 08 '14 at 22:26
  • @robmayoff I'm using an old computer with just 2 gigs of RAM. Trying to learn the skills before just buying tons of RAM. One of my columns typically will have about 1000 distinct values and those distinct values are about 12 characters is length. The purpose of the procedure is to compress those values because they are unnecessarily long . – Nick Aug 08 '14 at 22:56
  • @CharlesDuffy Thanks for the valuable comment. I did not realize that `sort` works entirely in memory. Are you aware of a workaround for this? – Nick Aug 08 '14 at 22:57
  • @Nick, read my comment again -- GNU `sort` writes to temporary files when it can't fit content in memory. – Charles Duffy Aug 08 '14 at 23:23
  • ...there isn't really room for a workaround by the nature of the task -- I can think of an algorithm by which one could limit the _size_ of the working space needed (whether that space is in RAM or on disk), but at a substantial cost in algorithmic complexity (forcing additional full passes over the data). – Charles Duffy Aug 09 '14 at 00:09
  • It sounds like you can do this in 2 passes, by passing the filename twice to the awk script. The first pass will look like `NR==FNR { A22[$22];A23[$23];` and so on just to build up lists of unique column values. As soon as the pass is finished, sort the arrays' keys and assign the elements values of 1, 2, etc, for example A22["abcd"] will be 1, A22["abce"] will be 2. Second pass, do `{$22=A22[$22];$23=A23[$23];...;print $0}` – Mark Plotnick Aug 09 '14 at 00:37
  • @MarkPlotnick, if I read that correctly, there's a significant risk of memory exhaustion, no? – Charles Duffy Aug 09 '14 at 03:01
  • @CharlesDuffy Depends on the number of unique strings. OP mentioned one column with 1000 distinct values, 12 characters each. If that's the biggest set, and there are 20 columns like that, it seems like not a lot of memory. – Mark Plotnick Aug 09 '14 at 03:12
  • 1000 values/column * 12 bytes/value * 20 columns = 240000 bytes. Even if awk's hash table has 3x overhead, that's still under one megabyte. On a host with 2 GB of RAM, that should not be a problem. – rob mayoff Aug 09 '14 at 06:07
  • @robmayoff Could you spell out your algorithm for me? Are you justifying Mark's algorithm? I don't understand that one either. – Nick Aug 09 '14 at 13:03
  • A truncated comment showed up in my feed asking where, in the OP's code, an input field was being replaced . I can't find the comment here, but: `$val=count` will replace the field whose index is `val`. – Mark Plotnick Aug 09 '14 at 16:11
  • Agreed -- _if_ the number of unique values is small, a two-pass approach is very reasonable; I hadn't gathered that to be the case across the board. – Charles Duffy Aug 09 '14 at 16:25
  • @Nick, basically, Mark's observation is that you can build up a map of original values to new values for each column (in a first pass), and then apply it in a second pass -- much faster than needing two passes per column, *but* it has the caveat that memory requirements scale with the total number of unique values in all the columns being processed. – Charles Duffy Aug 09 '14 at 16:27

3 Answers3

8

The real Right Answer is to rewrite your process to not need this kind of pipeline. However, if you do want to set up such a pipeline, use a recursive function (that pipes to itself):

process_column() {
  sort -k "$1" -t, | awk -v val="$1" -f cycle.awk
}

process_column_range() {
  local min_col=$1
  local max_col=$2
  if (( min_col < max_col )); then
    process_column "$min_col" \
     | process_column_range "$(( min_col + 1 ))" "$max_col"
  else
    process_column "$min_col"
  fi
}

...and then, to invoke (notice that no cat is needed):

process_column_range 22 41 <input.csv >output.csv
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
0

Here's a two-pass approach using the algorithm suggested by @robmayoff. It could be implemented in awk or native bash (the latter at a significant performance penalty), but I'm going with Python for readability:

#!/usr/bin/env python

import sys, collections, itertools

input_file_name = sys.argv[1]
col_start = int(sys.argv[2])
col_end = int(sys.argv[3]) + 1

vals = collections.defaultdict(set)

# first pass: build up translation tables for each column
for line in open(input_file_name, 'r'):
  cols = line.split(',') # if this is real CSV, use the Python csv module instead
  for col_no in range(col_start, col_end):
    val = cols[col_no]
    if not val in vals[col_no]: # O(1) operation on sets, vs O(n) on lists
      vals[col_no].add(val)

# interim processing: make sets into dicts w/ values in ascending order
for col_no in vals.iterkeys():
  vals[col_no] = dict(itertools.izip(sorted(list(vals[col_no])),
                                     (str(n) for n in itertools.count())))

# second pass: apply translation tables and print output
for line in open(input_file_name, 'r'):
  cols = line.split(',')
  for col_no in range(col_start, col_end):
    val = cols[col_no]
    cols[col_no] = vals[col_no][val]
  print ','.join(cols)

I don't propose this as an accepted answer, since it doesn't actually answer the question posed (constructing chains of pipes), but if the number of unique values in the columns being renumbered is low, it may be useful to you.

Invoke as:

./process_column_range input.csv 22 41 >output.csv
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
0

Here is my suggestion for a two-pass solution based on comments by @robmayoff. It uses gawk (which offers built-in sorting functions). As long as the space required to store all the distinct column values doesn't get into the multi-gigabyte range, it ought to run well, and will be much faster than doing 20 sorts and awk passes.

This sample sorts on columns 2, 3, and 4.

s.awk:

# makemaps() replaces the values in the str-indexed arrays with integers,
# sorted by ascending index value
function makemaps() {
    PROCINFO["sorted_in"]="@ind_str_asc";
    n=1;
    for(i in A2) A2[i]=n++;
    n=1;
    for(i in A3) A3[i]=n++;
    n=1;
    for(i in A4) A4[i]=n++;
    mapsdone=1;
}
BEGIN { FS=","; OFS=","; mapsdone=0; }
{
    if (NR == FNR) {
        # first pass
        # allocate array elements by index. Don't need to assign values yet.
        A2[$2];A3[$3];A4[$4];
    } else {
        # second pass
        # if not yet done, set up arrays' values to be small sequential integers
        if (!mapsdone) makemaps();
        # replace fields with the corresponding small integers
        $2=A2[$2];
        $3=A3[$3];
        $4=A4[$4];
        print $0;
    }
}

input file:

1,abcd,red,mercury
2,efgh,orange,mercury
3,abcd,green,venus
4,abcd,blue,earth
5,efgh,red,earth

output of gawk -f s.awk input input (you need to list the input file twice):

1,1,4,2
2,2,3,2
3,1,2,3
4,1,1,1
5,2,4,1

As a larger test, I used this script to generate a 48-million line input file with three 12-character columns:

BEGIN {
    OFS=",";
    for(i=0; i<48000000; i++) {
        print i,
        "aaaaaaaaa" int(1000*rand()),
        "bbbbbbbbb" int(1000*rand()),
        "ccccccccc" int(1000*rand());
    }
}

Running /usr/bin/time -v awk -f s.awk input input > output resulted in

Command being timed: "awk -f s.awk input input"
User time (seconds): 139.73
System time (seconds): 6.12
Percent of CPU this job got: 94%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:34.85
Maximum resident set size (kbytes): 1896

This is on a single-core VMWare CPU on a 3.4GHz system.

So with 20 columns, it may take 17 minutes or so and use up no more than 15 megabytes of RAM.

Mark Plotnick
  • 9,598
  • 1
  • 24
  • 40
  • Nice -- obviously, your awk-fu is stronger than mine (hence, my retreat to Python). Question -- do you have a sane way to avoid hardcoding the column range (or size of same)? (I was thinking that in a language without nested dict/map types, one could use the column number as a prefix -- ideally padded out so each column sorts together -- but Python avoided the need to attempt such an implementation). – Charles Duffy Aug 10 '14 at 22:39
  • @CharlesDuffy As far as I know, no. awk doesn't have eval or references, so the closest I could get would be to do as you said, use the column number as part of an array index or value. – Mark Plotnick Aug 11 '14 at 14:01