-1

How can I convert line wise frequency distributions from multiple TXT files into a single matrix? Each of the files has exactly the same structure in that all words/terms/phrases are in the same order and contained in every file. Unique for each file is the filename, an issue date and the respective frequency of the words/terms/phrases given by a number after ":", see the following:

How my input files look like:

Company ABC-GH Date:31.12.2012
financial statement:4
corporate-taxes:8
assets:2
available-for-sale property:0
auditors:213

123-Company XYZ Date:31.12.2012
financial statement:15
corporate-taxes:3
assets:8
available-for-sale property:2
auditors:23

I have multiple files which have the exact same order of words/phrases and only differ in the frequency (number behind ":")

Now I want to create a single file containing a matrix, which keeps all words as top column and attaches the file characteristics (filename, date and frequencies) as row wise entries, which are comma separated in order to further process them, i.e. if the term after the 3rd comma (forth entry) is "corporate-taxes" than for each row the forth entry should be the relevant frequency of that term in the document.

Desired Output:

Filename,Date,  financial statement,    corporate-taxes, ..  auditors
COMPANY ABC-GH      , 2008 ,           15      ,      3       ,     23
123-COMPANY XYZ      , 2010 ,            9      ,      6       ,     11

At the end I want to write the outcome to a TXT file. Do you have an idea?

Dominik Scheld
  • 125
  • 2
  • 9

1 Answers1

0

Say that you have a list of files

lof = ['a1.txt', 'a2.txt', 'b1.txt']

you can initialize your result to a null list

res = []

and then append to the result a particular list comprehension that you compute for each input file

for f in lof:
    res += [[entry.split(':')[1] for entry in cdata ]
             for cdata in [data.splitlines() for data in open(f).read().split('\n\n')]]

Let's look at the inner part of the comprehension, for a file that has the same content of your example, 'ex.txt'

In [44]: [d.splitlines() for d in open('ex.txt').read().split('\n\n')]
Out[44]: 
[['Company ABC-GH Date:31.12.2012',
  'financial statement:4',
  'corporate-taxes:8',
  'assets:2',
  'available-for-sale property:0',
  'auditors:213'],
 ['123-Company XYZ Date:31.12.2012',
  'financial statement:15',
  'corporate-taxes:3',
  'assets:8',
  'available-for-sale property:2',
  'auditors:23']]

What is each cdata in the outer part of the comprehension?

In [45]: for cdata in [d.splitlines() for d in open('ex.txt').read().split('\n\n')]:
   ....:     print cdata
   ....:     
['Company ABC-GH Date:31.12.2012', 'financial statement:4', 'corporate-taxes:8', 'assets:2', 'available-for-sale property:0', 'auditors:213']
['123-Company XYZ Date:31.12.2012', 'financial statement:15', 'corporate-taxes:3', 'assets:8', 'available-for-sale property:2', 'auditors:23']

For each cdata (i.e., company data) we want a list with only the part after the :, so we split on ':' & keep only the index 1 element

In [46]: [[entry.split(':')[1]for entry in cdata]]
Out[46]: [['31.12.2012', '15', '3', '8', '2', '23']]

It's just a matter to put it all together in a single list comprehension

In [47]: [[entry.split(':')[1]for entry in comp_data ]for comp_data in [data.splitlines()for data in open('dele.txt').read().split('\n\n')]]
Out[47]: 
[['31.12.2012', '4', '8', '2', '0', '213'],
 ['31.12.2012', '15', '3', '8', '2', '23']]

and put it into the loop that I showed before, accumulating the result for all your input files

gboffi
  • 22,939
  • 8
  • 54
  • 85
  • thanks a lot for your help, that is almost what I need. I have 3 more questions, which I hope you can answer as well: 1)Is there a way to begin the result list with the first frequency (i.e. not the date "31.12.2012"). 2)When I write "res" to a file like outfile.write(str(res)) is there a way to have each company file, i.e. each inner list element in a separate row ['4', '8', '2', '0', '213' \n '15', '3', '8', '2', '23' \n]? 3)Each filename in the list of files ("lof") is structured: CompanyName-SerialNumber:IssueDate_IFRS.txt. Can I extract that information for each row before the frequencies – Dominik Scheld Apr 05 '15 at 11:20
  • my desired output would then look like: 'CompanyNameA','SerialNumberA','IssueDateA','FrequencyA1','FrequencyA2',...'FrequencyAN' \newline 'CompanyNameB','SerialNumberB','IssueDateB','FrequencyB1','FrequencyB2',...'FrequencyBN' \newline ... – Dominik Scheld Apr 05 '15 at 11:23
  • Dominik, I tried to answer what you asked for... or at least what I thought you've asked... Now you realize that your problem requires a different solution and you give a new explanation of your requirements (that, I'm sorry, I cannot understand). What am I supposed to do? Please, try to ask a new question in which you clearly show two or three data files, their names, some code that you have written and a correct example of the required output, and it is possible that someone can help you. – gboffi Apr 05 '15 at 16:26
  • thanks gboffi, I tried to better formulate what I intend, hope you do understand my question now, your solution is allready pretty close. Here is the link to the updated question: http://stackoverflow.com/questions/29467731/comma-separated-matrix-from-txt-files-continued – Dominik Scheld Apr 06 '15 at 08:21