1

I have a text file containing some thousand lines as follows:

File:

abc: bla1 bla1 bla1... 
cde: bla bla bla... 
ghk: bla1 bla1 bla1... 
lmn: bla bla bla...
abc: bla2 bla2 bla2... 
bcd: bla bla bla... 
ghk: bla2 bla2 bla2... 
xyz: bla bla bla...

I want to merge all the lines that start with the same items (as 1 and 5, 3 and 7) so that I have a new text file like this:

New File:

abc: bla1 bla1 bla1... * abc: bla2 bla2 bla2... 
cde: bla bla bla... 
ghk: bla1 bla1 bla1... * ghk: bla2 bla2 bla2...
lmn: bla bla bla...
bcd: bla bla bla...   
xyz: bla bla bla...

I wonder if this is possible to be solved using regex and/or grep, and if yes then how can I solve it?

I'm quite familiar with grep because I'm on TextWrangler, but also OK with other text editors.

Help much appreciated.

Jotne
  • 40,548
  • 12
  • 51
  • 55
Niamh Doyle
  • 1,909
  • 7
  • 30
  • 42
  • 1
    I don't think there is an elegant solution for this. Try a Perl approach. First pass, populate a hash with key's being the start items where the hash value is an array of line numbers containing the start item. Duplicate the file. Second pass, from one file copy to next (merge) based on hash. Third pass, delete lines based on hash. –  Aug 11 '14 at 18:47
  • 1
    Does order matter? If not, sort first. Then you'll have an 'xyz' line followed by another 'xyz' and you can use a regex that will merge those lines into one. – OnlineCop Aug 11 '14 at 21:37

3 Answers3

3

With GNU bash. If the order does not matter.

declare -A A      # declare associative array A
# fill array
while read I L; do 
  [ ${#A[$I]} -gt 0 ] && A[$I]+=" * $L"
  [ ${#A[$I]} -eq 0 ] && A[$I]+=" $L"
done < filename
# print array
for J in "${!A[@]}"; do echo "$J${A[$J]}"; done

Output:

xyz: bla bla bla...
lmn: bla bla bla...
abc: bla1 bla1 bla1... * bla2 bla2 bla2...
ghk: bla1 bla1 bla1... * bla2 bla2 bla2...
bcd: bla bla bla...
cde: bla bla bla...
Cyrus
  • 84,225
  • 14
  • 89
  • 153
2

If order doesn't matter, I suggest first sorting the text. That will place

abc: ...
abc: ...

next to one another. Then you'll run this regex through a few passes:

Search:
  ^(\w+): (.*)\n\1: 
Replace:
  \1: \2 

Result:
   abc: bla1 bla1 bla1... bla2 bla2 bla2...
   bcd: bla bla bla...
   cde: bla bla bla...
   ghk: bla1 bla1 bla1... bla2 bla2 bla2...
   lmn: bla bla bla...
   xyz: bla bla bla...

If order DOES matter, then this regex can be run through a few times:

Search:
  ^(\w+): (.*)\n((?:(?!\1).*\n)+)\1: (.*\n)
Replace:
  \1: \2 \4\3

Result (1st pass):
  abc: bla1 bla1 bla1... bla2 bla2 bla2...
  cde: bla bla bla...
  ghk: bla1 bla1 bla1...
  lmn: bla bla bla...
  bcd: bla bla bla...
  ghk: bla2 bla2 bla2...
  xyz: bla bla bla...

Result (2nd pass):
  abc: bla1 bla1 bla1... bla2 bla2 bla2...
  cde: bla bla bla...
  ghk: bla1 bla1 bla1... bla2 bla2 bla2...
  lmn: bla bla bla...
  bcd: bla bla bla...
  xyz: bla bla bla...
OnlineCop
  • 4,019
  • 23
  • 35
0

If you can use awk, this should work:

awk '{a[$1]=a[$1]?a[$1]"* "$0:$0} END {for (i in a) print a[i]}' file
ghk: bla1 bla1 bla1... * ghk: bla2 bla2 bla2...
lmn: bla bla bla...
cde: bla bla bla...
xyz: bla bla bla...
bcd: bla bla bla...
abc: bla1 bla1 bla1... * abc: bla2 bla2 bla2..

.

Jotne
  • 40,548
  • 12
  • 51
  • 55