Finding columns with only white space in a text file and replace them with a unique separator

Question

I have a file like this:

aaa  b b ccc      345
ddd  fgt f u      3456
e r  der der      5 674

As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.

aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674

Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.

Better explanation of the question by josifoski : Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.

Is it one space between `f` and `u`, while it is more than one space between `fgt` and `f`? — akrun, Jun 16 '15 at 13:15
No, the column width can range from 1 to 20. I put two white spaces for the second column intentionally. — user1436187, Jun 16 '15 at 13:18
If the column just have a white space. Nothing else. If we find all columns with one or more white spaces and replace them with `,` (all the column) the problem will be solved. — user1436187, Jun 16 '15 at 13:26
Oh right, so we just need to find the columns which are delimited by whitespace, with whitespace in them, to determine which columns contain whitespace. — 123, Jun 16 '15 at 13:29
@user1436187 THINK a bit harder about what you're telling us. There is one white space between `der` and the second `der`. You're telling us that just one blank char means the text on either side of it is all part of one column but your output shows 2 columns, `der,der`. You also seem to be saying `one space is within a column` and also `one or more spaces separate columns` which are completely inconsistent statements. — Ed Morton, Jun 16 '15 at 13:31
True mind game, i like this question. Well per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line — josifoski, Jun 16 '15 at 13:34
@josifoski there is no text manipulation you can do in python that you can't do in awk. In fact we see many questions from people writing python scripts asking how to call awk to manipulate text from their python script but never the reverse. — Ed Morton, Jun 16 '15 at 13:46
@EdMorton Its mostly used on mainframes, but it's pretty good for stuff like ops problem. — 123, Jun 16 '15 at 15:00

score 4 · Accepted Answer · answered Jun 16 '15 at 13:38

4

$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
    for (i=1;i<=NF;i++) {
        if ($i == " ") {
            space[i]
        }
        else {
            nonSpace[i]
        }
    }
    next
}
FNR==1 {
    for (i in nonSpace) {
        delete space[i]
    }
}
{
    for (i in space) {
        $i = ","
    }
    gsub(/,+/,",")
    print
}

$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674

answered Jun 16 '15 at 13:38

Ed Morton

188,023
17
78
185

1

speedy! needs explanation – josifoski Jun 16 '15 at 13:40
1

Feel free to provide said explanation :-). I think it's pretty simple and clear with maybe a couple of references to the man page for newbies so I'd rather the OP think about it and ask questions if any. – Ed Morton Jun 16 '15 at 13:42
@EdMorton I think it deserves at least a little explanation. This part `ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ ` would definitely be confusing for a newbie even if they do read the man page. I think you forget how much more difficult it is to follow program logic when you are not familiar with the language. – 123 Jun 16 '15 at 14:24
1

I'd much rather answer questions than have to explain each line that MIGHT be confusing. Between google and the man pages I think a LITTLE effort would answer any questions but I'm happy to answer any for anyone who can't figure it out after putting that little bit of effort in. – Ed Morton Jun 16 '15 at 15:48

123 · Answer 2 · 2015-06-17T11:58:04.773

Another in awk

awk 'BEGIN{OFS=FS=""}  # Sets field separator to nothing so each character is a field

FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next}  #Increments array with key as character 
                                  #position based on whether a space is in that position.
                                  #Skips all further commands for first file.
     {                            # In second file(same file but second time)
        for(i=1;i<=NF;i++)        #Loops through fields
           if(!a[i]){             #If field is set
              $i=","              #Change field to ","
              x=i                 #Set x to field number
              while(!a[++x]){     # Whilst incrementing x and it is not set
                 $x=""            # Change field to nothing
                 i=x              # Set i to x so it doesnt do those fields again
              }
           }
      }1' test{,} #PRint and use the same file twice

Yeah I wonder how often someone takes a commented script and thinks they'll "fix" it by adding punctuation :-). — Ed Morton, Jun 17 '15 at 12:00

score 0 · Answer 3 · answered Jun 18 '15 at 14:06

Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.

# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)

Finding columns with only white space in a text file and replace them with a unique separator

3 Answers3

Linked