How can I convert line wise frequency distributions from multiple TXT files into a single matrix? Each of the files has exactly the same structure in that all words/terms/phrases are in the same order and contained in every file. Unique for each file is the filename, an issue date and the respective frequency of the words/terms/phrases given by a number after ":", see the following:
How my input files look like:
Company ABC-GH Date:31.12.2012
financial statement:4
corporate-taxes:8
assets:2
available-for-sale property:0
auditors:213
123-Company XYZ Date:31.12.2012
financial statement:15
corporate-taxes:3
assets:8
available-for-sale property:2
auditors:23
I have multiple files which have the exact same order of words/phrases and only differ in the frequency (number behind ":")
Now I want to create a single file containing a matrix, which keeps all words as top column and attaches the file characteristics (filename, date and frequencies) as row wise entries, which are comma separated in order to further process them, i.e. if the term after the 3rd comma (forth entry) is "corporate-taxes" than for each row the forth entry should be the relevant frequency of that term in the document.
Desired Output:
Filename,Date, financial statement, corporate-taxes, .. auditors
COMPANY ABC-GH , 2008 , 15 , 3 , 23
123-COMPANY XYZ , 2010 , 9 , 6 , 11
At the end I want to write the outcome to a TXT file. Do you have an idea?