How to split a string or file that may be delimited by a combination of comments and spaces, tabs, newlines, commas, or other characters

Question

If file: list.txt contains really ugly data like so:

aaaa 
#bbbb
cccc, dddd; eeee
 ffff;
    #gggg hhhh
iiii

jjjj,kkkk ;llll;mmmm
nnnn

How do we parse/split that file, excluding the commented lines, delimiting it by all commas, semicolons, and all white-space (including tabs, spaces, and newline and carrage-return characters) with a bash script?

score 1 · Answer 1 · edited May 23 '17 at 10:25

It can be done with the following code:

#!/bin/bash
### read file:
file="list.txt"

IFSO=$IFS
IFS=$'\r\n'
while read line; do
    ### skip lines that begin with a "#" or "<whitespace>#"
    match_pattern="^\s*#"
    if [[ "$line" =~ $match_pattern ]];
        then 
        continue
    fi

    ### replace semicolons and commas with a space everywhere...
    temp_line=(${line//[;|,]/ })

    ### splitting the line at whitespaces requires IFS to be set back to default 
    ### and then back before we get to the next line.
    IFS=$IFSO
    split_line_arr=($temp_line)
    IFS=$'\r\n'
    ### push each word in the split_line_arr onto the final array
    for word in ${split_line_arr[*]}; do
            array+=(${word})
    done
done < $file

echo "Array items:"
for item in ${array[*]} ; do
    printf "   %s\n" $item
done

This was not posed as a question, but rather a better solution to what others have touched upon when answering other related questions. The bit that is unique here is that those other questions/solutions did not really address how to split a string when it is delimited with a combination of spaces and characters and comments; this is one solution that address all three simultaneously...

Related questions:

How to split one string into multiple strings separated by at least one space in bash shell?

How do I split a string on a delimiter in Bash?

Additional notes:

Why do this with bash when other scripting languages are better suited for splitting? A bash script is more likely to have all the libraries it needs when running from a basic upstart or cron (sh) shell, compared with a perl program for example. An argument list is often needed in these situations and we should expect the worst from people who maintain those lists...

Hopefully this post will save bash newbies a lot of time in the future (including me)... Good luck!

BMW · Accepted Answer · 2014-01-09T17:55:04.413

1

Using shell commands:

grep -v "^[ |\t]*#" file|tr ";," "\n"|awk '$1=$1'

edited Jan 09 '14 at 17:55

answered Jan 09 '14 at 04:09

BMW

42,880
12
99
116

1

Can you make your answer more descriptive? For example, explain why it works, what each argument, etc. represents. This will make your answer more valuable for future readers and help teach the OP. – Aaron Brager Jan 09 '14 at 04:30
I am not sure what parts you have questions, grep, tr, awk are all popular shell commands, I am not here to teach something. People can search in google for huge samples to understand these commands if they want to learn. Why we need treat them as baby? – BMW Jan 09 '14 at 04:39
This is a nice trick, except I would write it like this to get it into a useable array: array=\`grep -v "[^ |\t]*#" list.txt|tr ";," "\n"|awk '$1=$1'\` Now I'll attempt to explain how it works: The grep command returns all lines that do not match a comment pattern of whitespace leading up to a # char. The tr command then swaps all instances of ";" or "," with a new-line. The awk command is a nifty trick that removes leading and trailing whitespace from lines. Hence you are left with the desired data... – Damian Green Jan 09 '14 at 18:06

score 1 · Answer 3 · answered Jan 09 '14 at 06:45

1

sed 's/[# \t,]/REPLACEMENT/g' input.txt

above command replaces comment characters ('#'), spaces (' '), tabs ('\t'), and commas (',') with an arbitrary string ('REPLACEMENT')
to replace newlines, you can try:

sed 's/[# \t,]/replacement/g' input.txt | tr '\n' 'REPLACEMENT'

answered Jan 09 '14 at 06:45

csiu

3,159
2
24
26

if you want to ignore comments, try `$ YOUR-CMD-HERE | grep -v '^#'`. This will ignore lines beginning with `'#'` – csiu Jan 09 '14 at 19:44

score 0 · Answer 4 · answered Jan 09 '14 at 03:56

0

if you have Ruby on your system

File.open("file").each_line do |line|
  next if line[/^\s*#/]
  puts line.split(/\s+|[;,]/).reject{|c|c.empty?}  
end

output

# ruby test.rb 
aaaa
cccc
dddd
eeee
ffff
iiii
jjjj
kkkk
llll
mmmm
nnnn

answered Jan 09 '14 at 03:56

kurumi

25,121
5
44
52

How to split a string or file that may be delimited by a combination of comments and spaces, tabs, newlines, commas, or other characters

4 Answers4