1

What is fast and succinct way to match lines from a text file with a matching first field.

Sample input:

a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit

Desired output:

b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

Desired output, alternative:

b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

I can imagine many ways to write this, but I suspect there's a smart way to do it, e.g., with sed, awk, etc. My source file is approx 0.5 GB.

There are some related questions here, e.g., "awk | merge line on the basis of field matching", but that other question loads too much content into memory. I need a streaming method.

Community
  • 1
  • 1
some ideas
  • 64
  • 3
  • 14
  • 6
    Explain WHY that's the desired output as it's not obvious at all. e.g. are you looking for a tool that will let you specify b, d, and e as desirable key values or are you looking for cases where the key appears twice in the input or something else? – Ed Morton Aug 28 '13 at 16:29
  • I want to merge lines with a matching first field. Sorry this was unclear. Also, the input is sorted. – some ideas Aug 28 '13 at 17:08

5 Answers5

3

For fixed width fields you can used uniq:

$ uniq -Dw 1 file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

If you don't have fixed width fields here are two awk solution:

awk -F'|' '{a[$1]++;b[$1]=(b[$1])?b[$1]RS$0:$0}END{for(k in a)if(a[k]>1)print b[k]}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
  • Thanks. The second field is an unpredictable length, often >100 chars. btw, those arguments for "uniq" are not available in MacOS, nor Ubuntu. – some ideas Aug 28 '13 at 16:40
  • Fair enough, the two `awk` scripts should do the trick for you. Are you sure they are not available on your Ubuntu machine? What version of coreutils do you have. `uniq --version - uniq (GNU coreutils) 8.21` – Chris Seymour Aug 28 '13 at 16:42
  • Thanks! The second is really what I need. Your methods work well; 0m29.103s processing for the first, and 0m34.036s for the second. – some ideas Aug 28 '13 at 16:51
3

Here's a method where you only have to remember the previous line (therefore requires the input file to be sorted)

awk -F \| '
    $1 == prev_key {print prev_line; matches ++}
    $1 != prev_key {                            
        if (matches) print prev_line
        matches = 0
        prev_key = $1
    }                
    {prev_line = $0}
    END { if (matches) print $0 }
' filename
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

Alternate output

awk -F \| '
    $1 == prev_key {
        if (matches == 0) printf "%s", $1 
        printf "%s%s", FS, prev_value
        matches ++
    }             
    $1 != prev_key {
        if (matches) printf "%s%s\n", FS, prev_value
        matches = 0                                 
        prev_key = $1
    }                
    {prev_value = $2}
    END {if (matches) printf "%s%s\n", FS, $2}
' filename
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • But how can OP get `Desired output, alternative`? – anubhava Aug 28 '13 at 16:42
  • Your method works nicely, 0m16.330s processing. time awk -F \| '$1 == prev_key {print prev_line; matches ++} $1 != prev_key { if (matches) print prev_line; matches = 0; prev_key = $1; } {prev_line = $0} END { if (matches) print $0 } ' INFILE > OUTFILE – some ideas Aug 28 '13 at 16:47
1

Using awk:

awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2}
    END{for(i in b) print i b[i]}' file
d|amet|consectetur
e|adipisicing|elit
b|ipsum|dolor
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 4
    Would have high memory requirement when the input file is large – glenn jackman Aug 28 '13 at 16:38
  • My concern about awk was loading everything into memory, then recalling it in the END; but my concerns might be unwarranted. I'll try this. Thanks! – some ideas Aug 28 '13 at 16:38
  • 1
    To my surprise, your method works on my 0.5GB input file. 0m19.184s processing time. time awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2} END{for(i in b) print i b[i]}' INFILE > OUTFILE – some ideas Aug 28 '13 at 16:49
1

This might work for you (GNU sed):

sed -r ':a;$!N;s/^(([^|]*\|).*)\n\2/\1|/;ta;/^([^\n|]*\|){2,}/P;D' /file

This reads 2 lines into the pattern space then checks to see if the keys in both lines are the same. If so it removes the second key and repeats. If not it checks to see if more than two fields exist in the first line and if so prints it out and then deletes it otherwise it just deletes the first line.

potong
  • 55,640
  • 6
  • 51
  • 83
  • Thanks for this. I already used the awk, but it's useful having a sed solution. – some ideas Aug 28 '13 at 20:00
  • Note, on the mac, "sed -r" is "sed -E"; also note that your method did not work for me, at least on my mac, with the above test content. – some ideas Jul 30 '15 at 16:12
0
$ awk -F'|' '$1 == prev {rec = rec RS $0; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

$ awk -F'|' '$1 == prev {rec = rec FS $2; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • I tried your second method. It's fast, but I got some false hits. Thanks for the sample though. – some ideas Aug 28 '13 at 17:16
  • False hits? Hard to believe if your real input looks like your sample input but if you'd care to share what your input was and the undesirable output you got, I'd be happy to take a look. – Ed Morton Aug 29 '13 at 03:05
  • Ed, I don't mean to criticize, and the error might be on my side. All I can say is that when I ran a quick test, the output was not what I expected. My input is substantially more complex than the sample I gave, but still basically the same idea of 2 fields separated by a pipe. I don't think there's any need to delve into this more. Thanks again. – some ideas Aug 29 '13 at 14:53