0

I am having trouble with several bits of code, I am no expert in Linux Bash programming unfortunately so I have tried unsuccessfully to find something that works for my task all day and was hoping you could help guide me in the right direction.

I have many large files that I would like to split according to the third field within each of them, I would like to keep the header in each of the sub-files, and save the created sub-files in new directories created from the root names of the files.

The initial files stored in the original directory are:

Downloads/directory1/Levels_CHG_Lab_S_sample1.txt
Downloads/directory1/Levels_CHG_Lab_S_sample2.txt
Downloads/directory1/Levels_CHG_Lab_S_sample3.txt

and so on..

Each of these files have 200 columns, and column 3 contains values from 1 through 10. I would like to split each of the files above based on the value of this column, and store the subfiles in subfolders, so for example sub-folder "Downloads/directory1/sample1" will contain 10 files (with the header line) derived by splitting the file Downloads/directory1/Levels_CHG_Lab_S_sample1.txt.

I have tried now many different steps for these steps, with no success.. I must be making this more complicated than it is since the code I have tried looks aweful… Here is the code I am trying to work from:

FILES=Downloads/directory1/

for f in $FILES
  do
    # Create folder with root name by stripping file names
    fname=${echo $f | sed 's/.txt//;s/Levels_CHG_Lab_S_//'}
    echo "Creating sub-directory [$fname]"
    mkdir "$fname"

    # Save the header
    awk 'NR==1{print $0}' $f > header

    # Split each file by third column
    echo "Splitting file $f"
    awk  'NR>1  {print $0 > $3".txt" }' $f

    # Move newly created files in sub directory
    mv {1..10}.txt $fname  # I have no idea how to do specify the files just created

    # Loop through the sub-files to attach header row:
    for subfile in $fname
      do
       cat header $subfile >> tmp_file
       mv -f tmp_file $subfile
      done
done

All these steps seem very complicated to me, I would very much appreciate if you could help me solve this in the right way. Thank you very much for your help. -fra

user971102
  • 3,005
  • 4
  • 30
  • 37

1 Answers1

1

You have a few problems with your code right now. First of all, at no point do you list the contents of your downloads directory. You are simply setting the FILES variable to a string that is the path to that directory. You would need something like:

FILES=$(ls Downloads/directory1/*.txt)

You also never cd to the Downloads/directory1 folder, so your mkdir would create directories in cwd; probably not what you want.

If you know that the numbers in column 3 always range from 1 to 10, I would just pre-populate those files with the header line before you split the file.

Try this code to do what you want (untested):

BASEDIR=Downloads/directory1/
FILES=$(ls ${BASEDIR}/*.txt)

for f in $FILES; do
    # Create folder with root name by stripping file names
    dirname=$(echo $f | sed 's/.txt//;s/Levels_CHG_Lab_S_//')
    dirname="${BASENAME}/${dirname}/"
    echo "Creating sub-directory [$dirname]"
    mkdir "$dirname"

    # Save the header to each file
    HEADER_LINE=$(head -n1 $f)
    for i in {1..10}; do
      echo ${HEADER_LINE} > ${dirname}/${i}.txt
    done

    # Split each file by third column
    echo "Splitting file $f"
    awk -v dirname=${dirname} 'NR>1 {filename=dirname$3".txt"; print $0 >> filename }' $f
done
ebarrere
  • 222
  • 1
  • 12
  • Hi ebarrere, thank you very much for your help!! It looks like you understood exactly my problem, however there is still something odd going on when trying your steps: Finally your code gives me the 10 sub-files with the prefix "sample1", and a subfolder with the empty files with only headers, but it looks like the last awk command is not doing what was hoped, i.e. redirecting the subfiles to the subfolders, so it also doesn't append to the files with the header… I have tried with variations of this line filename=dirname$3".txt", but it doesn't seem to work..If you can spot why pls let me know – user971102 Jan 27 '14 at 01:07
  • At the awk line, if I use filename=dirname$3".txt" the files that have been split do not go to the sub-directory, while if I try to use a slash like this ${dirname}/$3 I get an error "expected newline or end of string" – user971102 Jan 27 '14 at 09:21
  • It looks like you need to add the trailing `/` to the `dirname` variable set in line 7, like this: `dirname="${BASENAME}/${dirname}/"`. Give that a try. – ebarrere Jan 27 '14 at 17:12
  • This doesn't work either…If I do this I get the Error: "mkdir: cannot create directory `//////////': File exists" – user971102 Jan 29 '14 at 16:27
  • Oh, it looks like I typo'd line 7 originally and never noticed. BASENAME should read BASEDIR as that's the variable we set. The full line should be `dirname="${BASEDIR}/${dirname}/"` – ebarrere Jan 30 '14 at 03:21
  • If you have more issues, try commenting out the "action lines" and echoing your variables at different parts of the script to make sure they're what you expect. See [this article](http://stackoverflow.com/questions/951336/how-to-debug-a-bash-script) for more tips on debugging bash scripts. Good luck! – ebarrere Jan 30 '14 at 03:26