How to delete prefix, suffix in a string matching a pattern and split on a character using sed?

Question

I have the following string, which is the output of a cassandra query in bash

col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d  (3 rows)

I want to split this string so as to remove the string in the beginning till the last + symbol and then remove the tail end, which is (XYZ rows).

So, the string becomes A|1|a B|2|b C|3|c D|4|d. Now, I want to split this string into multiple arrays that look like this

A 1 a 
B 2 b 
C 3 c 
D 4 d

so that I can iterate over each row using a for loop to do some processing. The number of rows can vary.

How can I do this using sed or grep?

I tried this for the first pass but it didn't work:

echo $string | sed 's/([0-9])rows//' | sed 's/[^+]//'

NOTE: the column strings can have multiple spaces in them ex: output of CQL query when written to file is

 topic   | partition | offset
---------+-----------+--------
 topic_2 |        31 |      4
 topic_2 |        30 |      4
 topic_2 |        29 |      4
 topic_2 |        28 |      4
 topic_2 |        27 |      4
 topic_2 |        26 |      4
 topic_2 |        25 |      4
 topic_2 |        24 |      4
 topic_2 |        23 |      4
 topic_2 |        22 |      4
 topic_2 |        21 |      4
 topic_2 |        20 |      4
 topic_2 |        19 |      4
 topic_2 |        18 |      4
 topic_2 |        17 |      4
 topic_2 |        16 |      4
 topic_2 |        15 |      4
 topic_2 |        14 |      4
 topic_2 |        13 |      4
 topic_2 |        12 |      4
 topic_2 |        11 |      4
 topic_2 |        10 |      4
 topic_2 |         9 |      4
 topic_2 |         8 |      4
 topic_2 |         7 |      4
 topic_2 |         6 |      4
 topic_2 |         5 |      4
 topic_2 |         4 |      4
 topic_2 |         3 |      4
 topic_2 |         2 |      4
 topic_2 |         1 |      4
 topic_2 |         0 |      4

(32 rows)

Are you really sure the original query produces its output on a single row? Looks rather like a lack of proper quoting of a variable or command substitution earlier in your script. Maybe see also something like http://stackoverflow.com/questions/26909408/export-cassandra-query-result-to-a-csv-file — tripleee, Jan 28 '16 at 07:58
And how can one tell apart the third column from the previous entry and the first column of the following entry? Can there be multiple sequences of spaces in a single column, like `topic 2`? — Benjamin W., Jan 29 '16 at 00:53

score 5 · Accepted Answer · edited May 23 '17 at 10:27

$ sed 's/[^+]*[+]*\(.*[^ ]\) *(.*)$/\1/;y/ |/\n /' <<< 'col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d  (3 rows)'
A 1 a
B 2 b
C 3 c
D 4 d

The substitution does the following _{(hat tip to potong for pointing out how to get rid of one more substitution)}:

s/
    [^+]*      # Match non-plusses
    [+]*       # Followed by plusses
    \(         # Capture the next group
        .*     # Any characters (greedily)
        [^ ]   # that end with a non-space
    \)         # End of capture group
     *         # Spaces
    (.*)       # Followed by whatever in parentheses
$/\1/          # Replace all that by the capture group

resulting in this intermediate stage:

$ sed 's/[^+]*[+]*\(.*[^ ]\) *(.*)$/\1/' <<< 'col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d  (3 rows)'
A|1|a B|2|b C|3|c D|4|d

The transformation (y///) turns all spaces into newlines and pipes into spaces.

Spaces other than the ones separating rows

If there are spaces within column and we assume that each entry has the format

[spaces]entry[spaces]

i.e., exactly two sets of spaces per entry, we have to replace the transformation y/// with another substitution,

s/\([^ |]\)\( \+[^ |]\)/\1\n\2/g

This looks for spaces following not a space or pipe and followed by not a space or pipe, and inserts a newline before those spaces. Result:

$ var='col1 | col2 | col3 +++++++++++ A | 1 | a B | 2 | b C | 3 | c D | 4 | d (3 rows)'
$ sed 's/[^+]*[+]*\(.*[^ ]\) *(.*)$/\1/;s/\([^ |]\)\( \+[^ |]\)/\1\n\2/g' <<< "$var"
 A | 1 | a
 B | 2 | b
 C | 3 | c
 D | 4 | d

As the last thing you want to capture in the back reference grouping is a non-space you can code it so: `s/[^+]*+*$.*[^ ]$ *([^)]*)$/\1/` this will obviate the need for the second substitution. — potong, Jan 28 '16 at 08:52
@potong That extra `s///` totally felt like it should be gone. Updated, thanks! — Benjamin W., Jan 28 '16 at 16:14
Very close..This doesnt seem to work when there is space between the column string for ex: sed 's/[^+]*[+]*$.*[^ ]$ *(.*)$/\1/;y/ |/\n /' <<< 'col1 | col2 | col3 +++++++++++ A | 1 | a B | 2 | b C | 3 | c D | 4 | d (3 rows)' It gives wrong output — user330612, Jan 29 '16 at 00:21
@user330612 No, it doesn't because the question didn't mention spaces and only had one example without spaces. I'll look into it and amend the answer. — Benjamin W., Jan 29 '16 at 00:23

score 2 · Answer 2 · answered Jan 28 '16 at 07:23

echo 'col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d  (3 rows)' | 
sed -r "s/^.*\+//;s/\(.* rows\)//;s/ /\n/g;s/\|/ /g"
A 1 a
B 2 b
C 3 c
D 4 d

There are 4 substitutions:

turn from start until last plus (greedy) into nothing
turn parens, ending in 'rows' into nothing
replace blanks with newlines
make pipe characters blanks (order of commands matters)

score 0 · Answer 3 · answered Jan 28 '16 at 14:49

0

You can use sed and xargs -n flag to break the numbers up, xargs will echo by default: echo "A|1|a B|2|b C|3|c D|4|d" | sed 's/|/ /g;s/ / /g' | xargs -n 3

A 1 a
B 2 b
C 3 c
D 4 d

answered Jan 28 '16 at 14:49

How to delete prefix, suffix in a string matching a pattern and split on a character using sed?

3 Answers3

Spaces other than the ones separating rows