0

I have a two column spreadsheet (saved in CSV format) that is like this:

COLUMN 1,COLUMN 2
innovation,3-Entrepreneurship
countless,
innocent,2-Police
toilet handle,2-Bathroom
née dresses,3-Companies
odorless,2-Sense of Smell
old ideas,3-Entrepreneurship
new income streams,3-Entrepreneurship
Zoë’s food store,3-Companies
many,
crime,2-Police
bath room,2-Bathroom
ring,
móvíl résumés,3-Companies
musty smell,2-Sense of Smell
good publicity guru,3-Entrepreneurship
Señor,3-Companies

The full spreadsheet is 6000 rows (saved in CSV format, with commas used to separate the two columns). It has more than the Column 2 categories that are listed here.

As shown, some of the column 1 entries consists of two- or three- words separated by a space. They also make use of apostrophes and accented characters (these appear in multiple categories and not just the category titled 3-Companies).

You can download a sample of the full spreadsheet here.

I would like to split the CSV file up in to separate TXT files by the values in column 2. The separate files will no longer be a spreadsheet table, but just a list of words.

E.g. after split

In file 3-Entrepreneurship.txt

innovation
old ideas
new income streams
good publicity guru

In file 2-Bathroom.txt

toilet handle
bath room

In file 2-Police.txt

innocent
crime

In file 2-Sense of Smell.txt

odorless
musty smell

In file 3-Companies.txt

née dresses
Zoë’s food store
móvíl résumés
Señor

This is only a sample. The full file has more than 5 categories (in column 2), so there will be more than 5 post-split files.


Here is my code so far:

awk -F "," '{print $0 >> ($2 ".txt"); close($2 ".txt")}' test.csv

It works, but there are a few issues:

  1. The file names are printed out with question marks at the end. E.g. 2-Police?.txt 3-Entrepreneurship?.txt

  2. The files print out with two columns.

For example inside of 3-Entrepreneurship.txt it has the following data:

innovation  3-Entrepreneurship
ideas       3-Entrepreneurship
income      3-Entrepreneurship
publicity   3-Entrepreneurship
big_smile
  • 1,487
  • 4
  • 26
  • 59
  • 2
    You should print the first field, not the whole line: `print $1 >>`. Your second issue is probably due to DOS line ending ("\r\n") in the input file. You may want to call `awk` with `awk -v RS='\r\n'` – M. Nejat Aydin Aug 18 '20 at 14:24
  • @M. Nejat Aydin that works, but it adds a blank row before every row. Here is the full code I am using `awk -v RS='\r\n' -F "," '{print $1 >> ($2 ".csv"); close($2 ".csv")}' test1.csv` – big_smile Aug 18 '20 at 14:39
  • 2
    I don't know. Your awk call works fine on my box. But beware that the lines with only one field will be collected in a file named `.csv`. – M. Nejat Aydin Aug 18 '20 at 14:49
  • @M. Nejat Aydin I am using Mac OS 10.14.6 if that makes a difference? – big_smile Aug 18 '20 at 14:51
  • You may try to call awk with `awk -v RS='\r\n' -v ORS='\r'` – M. Nejat Aydin Aug 18 '20 at 14:54
  • @M.NejatAydin Whoops, I just realized, I am supposed to output the files as TXT format (as a list of lines), rather than CSV, so my code would be `awk -v RS='\r\n' -F "," '{print $1 >> ($2 ".txt"); close($2 ".txt")}' test1.csv` (which works, it just puts a blank row before each file. When I try `awk -v RS='\r\n' ORS='\r -F "," '{print $1 >> ($2 ".txt"); close($2 ".txt")}' test1.csv`, it says `-bash: syntax error near unexpected token `(` – big_smile Aug 18 '20 at 14:58
  • 3
    You have typos in the command line. It should have been `awk -v RS='\r\n' -v ORS='\r'`... – M. Nejat Aydin Aug 18 '20 at 15:01
  • 1
    @M.NejatAydin using more than 1 char in the RS is undefined behavipr per POSIX so setting `RS='\r\n'` will do what you want in GNU awk and maybe some others but in others still it'll behave as if you wrote `RS='\r'` which is probably why the OP is seeing blank lines before each record, it's the `\n` that was supposed to be at the end of each line now being put at the start of the next line. – Ed Morton Aug 18 '20 at 23:54
  • 1
    @big_smile wrt `I am using Mac OS 10.14.6 if that makes a difference` yes it does. The default awk on MacOS is BSD awk and it only expects single-char RS values and truncates any multi-char RS to the first char and so will treat `RS='\r\n'` as if you wrote `RS='\r'`. If you install GNU awk instead then you can use multi-char RS and a ton of other very useful extensions. – Ed Morton Aug 19 '20 at 00:03
  • 1
    wrt `-F ","` - the input file in your question has no commas in it so telling awk that your input is comma-separated is just going to break your script. – Ed Morton Aug 19 '20 at 00:07
  • @Ed Morton is there any way I can use `RS='\r\n'` on Mac without needing to install new things? Thanks! Also, the input file is in CSV format, so it does have commas. I will update the OP to make that clearer. – big_smile Aug 20 '20 at 09:39
  • @EdMorton @M.NejatAydin Here is the final code so far `awk -v RS='\r\n' -v ORS='\r' -F "," '{print $1 >> ($2 ".txt"); close($2 ".txt")}' test1.csv` It works fine, it just puts a blank line above the row (which I guess is because I am using Mac OS). In the chat (at https://chat.stackoverflow.com/rooms/220042/discussion-between-big-smile-and-m-nejat-aydin) it was suggest the input file could be the problem, so I have uploaded the input file here if that might shed any light: http://www.filedropper.com/sample_150 – big_smile Aug 20 '20 at 09:46
  • 1
    wrt "is there any way" - yes, just use [the script I posted](https://stackoverflow.com/a/63478213/1745001). The `sub(/\r$/,"")` simply removes the `\r` at the end of each line and so the default RS works as-is. Again, setting `RS='\r\n'` is the wrong thing to do on MacOS. – Ed Morton Aug 20 '20 at 12:17

2 Answers2

3
tail -n +2 file |
sort -t',' -k2 |
awk -F',' '$2~/^[[:space:]]*$/{next} {sub(/\r$/,"")} $2!=prev{close(out); out=$2".txt"; prev=$2} {print $1 > out}'

We sort the data first for efficiency so that awk doesn't have to keep opening/closing files line by line but instead only does that once per unique $2 value.

Here's the script working:

$ ls
file

$ cat file
COLUMN 1,COLUMN 2
innovation,3-Entrepreneurship
countless,
innocent,2-Police
toilet handle,2-Bathroom
née dresses,3-Companies
odorless,2-Sense of Smell
old ideas,3-Entrepreneurship
new income streams,3-Entrepreneurship
Zoë’s food store,3-Companies
many,
crime,2-Police
bath room,2-Bathroom
ring,
móvíl résumés,3-Companies
musty smell,2-Sense of Smell
good publicity guru,3-Entrepreneurship
Señor,3-Companies

.

$ tail -n +2 file | sort -t',' -k2 | awk -F',' '$2~/^[[:space:]]*$/{next} {sub(/\r$/,"")} $2!=prev{close(out); out=$2".txt"; prev=$2} {print $1 > out}'

.

$ ls
 2-Bathroom.txt   2-Police.txt  '2-Sense of Smell.txt'   3-Companies.txt   3-Entrepreneurship.txt   file

$ head -n 50 *.txt
==> 2-Bathroom.txt <==
bath room
toilet handle

==> 2-Police.txt <==
crime
innocent

==> 2-Sense of Smell.txt <==
musty smell
odorless

==> 3-Companies.txt <==
móvíl résumés
née dresses
Señor
Zoë’s food store

==> 3-Entrepreneurship.txt <==
good publicity guru
innovation
new income streams
old ideas
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
2

You are very close.

First, it seems that the words without a value in column 2 should not be written to a file. If so, add a test to your action. The test can be just $2 which will skip the action inside the { ... } if $2 is blank.

Second, if you want only the word, write only col 1 to the file -- not the entire line in $0

With your example, which is not , delimited, you can do:

awk '$2 {print $1 >> ($2 ".txt"); close($2 ".txt")}' file

Be sure to delete the .txt files or run in an empty directory since this appends the files if they already exist.

Produces:

2-Bathroom.txt:
toilet
bath

2-Police.txt:
innocent
crime

2-Sense.txt:
odorless
musty

3-Entrepreneurship.txt:
innovation
ideas
income
publicity
dawg
  • 98,345
  • 23
  • 131
  • 206
  • This sort of works, except `,3-Entrepreneurship.txt` is split into two files. The first file contains "innovation", "ideas" and "income", while the second file (`,3-Entrepreneurship?.txt`) contains only "publicity". Also, all file names have a comma before them. – big_smile Aug 20 '20 at 09:48