I have a two column spreadsheet (saved in CSV format) that is like this:
COLUMN 1,COLUMN 2
innovation,3-Entrepreneurship
countless,
innocent,2-Police
toilet handle,2-Bathroom
née dresses,3-Companies
odorless,2-Sense of Smell
old ideas,3-Entrepreneurship
new income streams,3-Entrepreneurship
Zoë’s food store,3-Companies
many,
crime,2-Police
bath room,2-Bathroom
ring,
móvíl résumés,3-Companies
musty smell,2-Sense of Smell
good publicity guru,3-Entrepreneurship
Señor,3-Companies
The full spreadsheet is 6000 rows (saved in CSV format, with commas used to separate the two columns). It has more than the Column 2 categories that are listed here.
As shown, some of the column 1 entries consists of two- or three- words separated by a space. They also make use of apostrophes and accented characters (these appear in multiple categories and not just the category titled 3-Companies
).
You can download a sample of the full spreadsheet here.
I would like to split the CSV file up in to separate TXT files by the values in column 2. The separate files will no longer be a spreadsheet table, but just a list of words.
E.g. after split
In file 3-Entrepreneurship.txt
innovation
old ideas
new income streams
good publicity guru
In file 2-Bathroom.txt
toilet handle
bath room
In file 2-Police.txt
innocent
crime
In file 2-Sense of Smell.txt
odorless
musty smell
In file 3-Companies.txt
née dresses
Zoë’s food store
móvíl résumés
Señor
This is only a sample. The full file has more than 5 categories (in column 2), so there will be more than 5 post-split files.
Here is my code so far:
awk -F "," '{print $0 >> ($2 ".txt"); close($2 ".txt")}' test.csv
It works, but there are a few issues:
The file names are printed out with question marks at the end. E.g.
2-Police?.txt
3-Entrepreneurship?.txt
The files print out with two columns.
For example inside of 3-Entrepreneurship.txt it has the following data:
innovation 3-Entrepreneurship
ideas 3-Entrepreneurship
income 3-Entrepreneurship
publicity 3-Entrepreneurship