how to extract specific information from many text files

Question

I have over 200 files . for example one of them is like below they are txt files. I want to read them one by one and then take specific information from them and export it to a xls file

As an example, how can I get the following information in a xls file

     TOTAL ENERGY            =       -444.38126 EV
      ELECTRONIC ENERGY       =       -840.31531 EV
      CORE-CORE REPULSION     =        395.93406 EV
      GRADIENT NORM           =          0.91931 = 0.45965 PER ATOM
      DIPOLE                  =          2.66600 DEBYE    POINT GROUP:       C2v 
      NO. OF FILLED LEVELS    =          6
      IONIZATION POTENTIAL    =         10.352991 EV
      HOMO LUMO ENERGIES (EV) =        -10.353  0.402
      MOLECULAR WEIGHT        =         30.0262
      COSMO AREA              =         60.70 SQUARE ANGSTROMS
      COSMO VOLUME            =         42.52 CUBIC ANGSTROMS

I read few posts and they wrote that one can use

sed -n ".." file.txt

The problem is that even if I am going to use that it will take me so long because i should read one file at the time into bash then I should go for each keywords like

          HEAT OF FORMATION 
          TOTAL ENERGY   
          ELECTRONIC ENERGY     
          CORE-CORE REPULSION  
          GRADIENT NORM        
          DIPOLE               
          NO. OF FILLED LEVELS   
          IONIZATION POTENTIAL   
          HOMO LUMO ENERGIES (EV) 
          MOLECULAR WEIGHT        
          COSMO AREA              
          COSMO VOLUME

Then I paste one by one the line to a xls file with their coresponding line information

                     SUMMARY OF PM7 CALCULATION, Site No: 29451

                                                       MOPAC2016 (Version: 18.063M)
                                                       Tue Mar 20 15:08:13 2018
                                                       No. of days remaining = 349

           Empirical Formula: C H2 O  =     4 atoms

 SYMMETRY
 Formaldehyde



     GEOMETRY OPTIMISED USING EIGENVECTOR FOLLOWING (EF).     
     SCF FIELD WAS ACHIEVED                                   

          HEAT OF FORMATION       =        -25.54241 KCAL/MOL =    -106.86944 KJ/MOL
          TOTAL ENERGY            =       -444.38126 EV
          ELECTRONIC ENERGY       =       -840.31531 EV
          CORE-CORE REPULSION     =        395.93406 EV
          GRADIENT NORM           =          0.91931 = 0.45965 PER ATOM
          DIPOLE                  =          2.66600 DEBYE    POINT GROUP:       C2v 
          NO. OF FILLED LEVELS    =          6
          IONIZATION POTENTIAL    =         10.352991 EV
          HOMO LUMO ENERGIES (EV) =        -10.353  0.402
          MOLECULAR WEIGHT        =         30.0262
          COSMO AREA              =         60.70 SQUARE ANGSTROMS
          COSMO VOLUME            =         42.52 CUBIC ANGSTROMS

          MOLECULAR DIMENSIONS (Angstroms)

            Atom       Atom       Distance
            H     3    O     1     2.00299
            H     4    O     1     1.65067
            H     4    C     2     0.00000
          SCF CALCULATIONS        =          4
          WALL-CLOCK TIME         =          0.309 SECONDS
          COMPUTATION TIME        =          0.033 SECONDS


          FINAL GEOMETRY OBTAINED
 SYMMETRY
 Formaldehyde

  O     0.00000000 +0    0.0000000 +0    0.0000000 +0     0     0     0
  C     1.20614565 +1    0.0000000 +0    0.0000000 +0     1     0     0
  H     1.09115836 +1  121.2760970 +1    0.0000000 +0     2     1     0
  H     1.09115836 +0  121.2760970 +0  180.0000000 +0     2     1     3

   3  1    4
   3  2    4

I want to export the data in one csv and each data under each other like below

data1
444.38126 EV
-840.31531 EV
395.93406 EV
0.91931 = 0.45965 PER ATOM
    2.66600 
    C2v 
    6
      10.352991
   -10.353  0.402
   30.0262
   60.70  
  42.52

I know how to read line by line each of the files. Lets assume the output file is output.txt

line_num=0
text=File.open('output.txt').read
text.gsub!(/\r\n?/, "\n")
text.each_line do |line|
  print "#{line_num += 1} #{line}"
end

so it works to read it line by line, now i try to extract those info

line_num=0
    text=File.open('output.txt').read
    text.gsub!(/\r\n?/, "\n")
    text.each_line do |line|
      if line[/TOTAL ENERGY/]
        puts line.split("=",2)[-1].strip
    end
    if line[/ELECTRONIC ENERGY/]
        toggle=1
        next
    end
    if line[/CORE-CORE REPULSION/]
        toggle=1
        next
    if line[/GRADIENT NORM/]
        toggle=1
        next
    if line[/DIPOLE/]
        toggle=1
        next
    if line[/NO. OF FILLED LEVELS/]
        toggle=1
        next
    if line[/IONIZATION POTENTIAL/]
        toggle=1
        next
    if line[/HOMO LUMO ENERGIES (EV)/]
        toggle=1
        next
    if line[/MOLECULAR WEIGHT /]
        toggle=1
        next
    if line[/COSMO AREA/]
        toggle=1
        next
    if line[/COSMO VOLUME/]
        toggle=1
        next

end

Well, you can start with this: [What are all the common ways to read a file in Ruby?](https://stackoverflow.com/q/5545068/125816) — Sergio Tulentsev, Mar 23 '18 at 14:09
@Sergio Tulentsev not informative. I know how to read one file line by line . look at above. The problem is how to extract the data from one of these files and then how to make it through over 200 files ? — , Mar 23 '18 at 14:13
What do you mean? You scan the file line by line until you see a line that matches one of your keywords. You then extract the info and store it. When you run out of files, you aggregate and print the results. — Sergio Tulentsev, Mar 23 '18 at 14:20
@Sergio Tulentsev for you it is very easy :-))) for me had been a nightmare , I could not do that — , Mar 23 '18 at 14:21
This can be broken into several smaller questions. How to find a line in a file (that satisfies a condition)? How to read several files, line by line, one after another? How to extract this specific value from a line in this specific format? And so on. As it is at the moment, the question is, essentially, "do this work for me (or most of it, at least)". But if you break the problem down to small pieces, most of them will be googlable, and we can help with the rest. — Sergio Tulentsev, Mar 23 '18 at 14:24
@Sergio Tulentsev I understand what you mean. I tried my best, please look above, I put what I have done, If i could do that I really would. — , Mar 23 '18 at 14:28
I don't have the time for a full answer, but you probably want to look into `Dir.glob` for a full list of the files in a directory. Which you can then`File.open` on each... `Dir.glob('files/*').each { |path| file = File.open(path) }`. Hope this helps — AJFaraday, Mar 23 '18 at 14:33
Ah, better now! I understand the frustration, but this question _is still_ too broad. If I were you, I'd use the advice above (breaking a big problem into smaller _independent_ sub-problems, which are not as overwhelming). — Sergio Tulentsev, Mar 23 '18 at 14:40
As an aside, wouldn't it make more sense for the exported CSV to have one column for each field name ("HEAT OF FORMATION," "TOTAL ENERGY", etc.) and one row for each file? Surely this would make using the data in Excel easier. — Jordan Running, Mar 23 '18 at 14:53
@Jordan Running sure, for me it does not matter so much , the problem is how to extract them :-D — , Mar 23 '18 at 14:56

score 0 · Answer 1 · answered Mar 23 '18 at 16:16

0

It must be in ruby? How about you read the files using bash them format the result in Excel?

For example:

for filename in *.txt; do
    awk '{print FILENAME ":" $0}' $filename | grep '[A-Z]\{3,\}.*=' >> r.csv
done

Will create r.csv file which you can open in Excel and format using the menu Data -> Text to Columns.

Them you could use the character "=" as a column separator for example.

answered Mar 23 '18 at 16:16

lhdv

146
1
6

it gives error Referenced from: /usr/local/bin/awk Reason: image not found dyld: Library not loaded: /usr/local/opt/mpfr/lib/libmpfr.4.dylib Referenced from: /usr/local/bin/awk Reason: image not found – Mar 23 '18 at 16:55
awk it's just to show the file name that you'll be reading. You can give a try replacing it with cat. ```cat $filename | grep '[A-Z]\{3,\}.*=' >> r.csv``` – lhdv Mar 23 '18 at 17:15
does not give me `cat: {print FILENAME ":" $0}: No such file or directory cat: {print FILENAME ":" $0}: No such file or directory` – Mar 23 '18 at 17:17
running only this `cat $filename | grep '[A-Z]\{3,\}.*=' >> r.csv` will generate the .csv but there is nothing in there – Mar 23 '18 at 17:19

how to extract specific information from many text files

1 Answers1

Linked