3

In my work building an English language database, I often deal with text content from different sources, and need to merge lines that share the same first field. I often hack this in a text editor with a regex that captures a first field, searching across "\n", but often I have text files >10GB, so a command-line, streaming solution is preferred to in-memory.

Sample input:

apple|pear 
apple|quince 
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise

Desired output:

apple|pear|quince 
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise

The logic is to concatenate (joined by "|") all lines with the same first field.

The only delimiter is "|", and the delimiter only appears once per input line. i.e. it's effectively a 2-column text file. The file sorting does not matter, the only concern is consecutive lines with the identical first field.

I have lots of solutions and one-liners (often in awk or ruby) to process same-line content, but I run into knots when dealing with multiple lines, and would appreciate help. For some reason, multiline processing always bogs me down.

I'm sure this is can be done succinctly with awk.

Michael Douma
  • 1,144
  • 8
  • 21

6 Answers6

3

Assumptions/understandings:

  • overall file may not be sorted (by 1st field)
  • all lines with the same string in the 1st field will be listed consecutively; this should eliminate the need to maintain a large volume of data in memory with the tradeoff that we'll need a bit more typing
  • 2nd field may contain trailing white space (per sample input); this will need to be removed
  • ouput does not need to be sorted (by 1st field)

One awk idea:

awk '

function print_line() {
    if (prev != "")
       print prev,data
}

BEGIN { FS=OFS="|" }

      { if ($1 != prev) {
           print_line()
           prev=$1
           data=""
        }
        gsub(/[[:space:]]+$/,"",$2)              # strip trailing white space
        data= data (data=="" ? "" : OFS) $2      # concatentate 2nd fields with OFS="|"
      }

END   { print_line() }                           # flush last set of data to stdout
' pipe.dat

This generates:

apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
3

Using any awk in any shell on every Unix box and assuming your input is grouped by the first field as shown in your sample input and you don't really have trailing blanks at the end of some lines:

$ cat tst.awk
BEGIN { FS=OFS="|" }
$1 != prev {
    if ( NR>1 ) {
        print out
    }
    out = prev = $1
}
{ out = out OFS $2 }
END { print out }

$ awk -f tst.awk file
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise

If it's not grouped then do sort file | awk -f tst.awk and if there are trailing blanks then add { sub(/ +$/,"") } as the first line of the script.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
2

Here is a Ruby solution that reads the file line-by-line. At the end I show how much simpler the solution could be if the file could be gulped into a string.

Let's first create an input file to work with.

str =<<~_
  apple|pear 
  apple|quince 
  apple cider|juice
  banana|plantain
  cherry|cheerful, crimson
  cherry|ruddy
  cherry|cerise
_
file_name_in = 'file_in'
File.write(file_name_in, str)
  #=> 112

Solution when file is read line-by-line

We can produce the desired output file with the following method.

def doit(file_name_in, file_name_out)  
  fin = File.new(file_name_in, "r")
  fout = File.new(file_name_out, "w")
  str = ''
  until fin.eof?
    s = fin.gets.strip
    k,v = s.split(/(?=\|)/)
    if str.empty?
      str = s
      key = k
    elsif k == key
      str << v
    else
      fout.puts(str)
      str = s
      key = k
    end
  end
  fout.puts(str)
  fin.close
  fout.close
end

Let's try it.

file_name_out = 'file_out'
doit(file_name_in, file_name_out)
puts File.read(file_name_out)

prints the following.

apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise

Note that

"apple|pear".split(/(?=\|)/)
  #=> ["apple", "|pear"]

The regular expression contains the positive lookahead (?=\|) which matches the zero-width location between 'e' and '|'.

Solution when file is gulped into a string

The OP does not want to gulp the file into a string (hence my solution above) but I would like to show how much simpler the problem is if one could do so. Here is one of many ways of doing that.

def gulp_it(file_name_in, file_name_out)
  File.write(file_name_out,
    File.read(file_name_in).gsub(/^(.+)\|.*[^ ]\K *\r?\n\1/, ''))
end
gulp_it(file_name_in, file_name_out)
  #=> 98
puts File.read(file_name_out)

prints

apple|pear|quince 
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy
cherry|cerise

Thinking about what the regex engine will be doing, this may be acceptably fast, depending on file size, of course.

Regex demo

While the link uses the PCRE engine the result would be the same using Ruby's regex engine (Onigmo). We can make the regular expression self-documenting by writing it in free-spacing mode.

/
^        # match the beginning of a line
(.+)     # match one or more characters
\|.*[^ ] # match '|', then zero or more chars, then a non-space
\K       # resets the starting point of the match and discards
         # any previously-matched characters 
[ ]*     # match zero or more chars
\r?\n    # march the line terminator(s)
\1       # match the content of capture group 1
/x       # invoke free-spacing mode

(.+) matches, 'apple', 'banana' and 'cherry' because those words are at the beginning lines. One could alternatively write ([^|]*).

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
1

Assuming you have the following sample.txt

apple|pear 
apple|quince 
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise

I am not sure why you want the solution as a "one liner", but the following will do what you want.

cat sample.txt | ruby -e 'puts STDIN.readlines.map {_1.strip}.group_by {_1.split("|").first}.map{|k,v| v.reduce("#{k}") {"#{_1}|#{_2.split("|").last}"}}'

A more readable version with comments describing what's going on:

stripped_lines = STDIN.readlines.map { |l| l.strip } # remove leading and trailing whitespace

# create a hash where the keys are the value to the left of the |
# and the values are lines begining with that key ie 
# {
#      "apple"=>["apple|pear", "apple|quince"],
#      "apple cider"=>["apple cider|juice"],
#      "banana"=>["banana|plantain"],
#      "cherry"=>["cherry|cheerful, crimson", "cherry|ruddy", "cherry|cerise"]
# }

grouped_by_first_element = stripped_lines.group_by { |sl| sl.split('|').first }

# map to the desired result by starting with the key
# and then concatinating the part to the right of the | for each element
# ie start with apple then append |pear to get apple|pear then append quince to that to get
# apple|pear|quince

result = grouped_by_first_element.map do |key, values|
    values.reduce("#{key}") do |memo, next_element|
        "#{memo}|#{next_element.split('|').last}"
    end 
end

puts result 
nPn
  • 16,254
  • 9
  • 35
  • 58
  • The reason for the one-liners is that I often build up scripts in bash/zsh for processing, and when a small operation can be squeezed into a oneliner, it's much more portable than having a ton of mini script files. Moreover, it makes it easy to combine with various unix commands (grep, uniq, sort, etc.) Usually awk is the king of super short powerful one-liners, but this is nice to have in ruby. – Michael Douma Feb 21 '22 at 15:35
0

If we assume s is a string containing all of the lines in the file.

s.split("\n").inject({}) { |h, x| k, v = x.split('|'); h[k] ||= []; h[k] << v.strip; h }

Will yield:

{"apple"=>["pear", "quince"], "apple cider"=>["juice"], "banana"=>["plantain"], "cherry"=>["cheerful, crimson", "ruddy", "cerise"]}

Then:

s.split("\n").inject({}) { |h, x| k, v = x.split('|'); h[k] ||= []; h[k] << v.strip; h }.map { |k, v| "#{k}|#{v.join('|')}" }

Yields:

["apple|pear|quince", "apple cider|juice", "banana|plantain", "cherry|cheerful, crimson|ruddy|cerise"]
Chris
  • 26,361
  • 5
  • 21
  • 42
0

A pure bash solution could look like this:

unset out       # make sure we start fresh (important if this is in a loop)
declare -A out  # declare associative array
d='|'           # delimiter

# append all values to the key
while IFS=${d} read -r key val; do
    out[${key}]="${out[${key}]}${d}${val}"
done <file

# print desired output
for key in "${!out[@]}"; do
    printf '%s%s\n' "${key}" "${out[$key]}"
done | sort -t"${d}" -k1


### actual output
apple cider|juice
apple|pear|quince
banana|plantain
cherry|cheerful, crimson|ruddy|cerise

Or you could do this with awk. As mentioned in a comment, pure bash is not a great option, mostly due to performance and portability.

awk -F'|' '{
        sub(/[[:space:]]*$/,"")  # only necessary if you wish to trim trailing whitespace, which existed in your example data
        a[$1]=a[$1] "|" $2       # append value to string
    } END {
        for(i in a) print i a[i] # print all recreated lines
    }' <file


### acutal output
apple|pear|quince
banana|plantain
apple cider|juice
cherry|cheerful, crimson|ruddy|cerise
Kaffe Myers
  • 424
  • 3
  • 9
  • 1
    As always, though, there's no reason to do it in bash and good reasons not to - see [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice). I bring this up because it's not obvious to a newcomer reading this that a pure bash solution will be orders of magnitude slower than, say, an awk solution as well as being lengthier and harder to get the syntax right so as to be secure and robust, and it'll be less portable. – Ed Morton Feb 19 '22 at 14:35