0

My research data is not a normal file. The column number of each row may be different. I want to convert the "dat" file into "*.csv" file using Perl because it can run efficiently (I have poor experience in Python). Below is the script I ran. It works fine in Linux, but it does not output the context in Python scripts.

new_dat_file="14689_bondlength.prmfrm.dat"
new_csv_file="14689_bondlength.prmfrm.csv"
perl -p -e 's/\\t\s+|\s+/,/g' $new_dat_file |perl -p -e 's/,FRAM/\\nFRAM/g' >  $new_csv_file

Above codes work in Linux. I improved them in Python scripts. If I use below command

import os
new_dat_file="14689_bondlength.prmfrm.dat"
new_csv_file="14689_bondlength.prmfrm.csv"
cmd = "perl -p -e 's/\\t\s+|\s+/,/g' " + new_dat_file + " |perl -p -e 's/,FRAM/\\nFRAM/g' 
os.system(cmd)

I can see the output correctly. However, if I modify cmd values like below to save the output into a csv file, but I got an empty file.

cmd = "perl -p -e 's/\\t\s+|\s+/,/g' " + new_dat_file + " |perl -p -e 's/,FRAM/\\nFRAM/g' > " + new_csv_file

How can I fix this issue? If I cannot do so, any other alternative way available? Any further suggestion would be highly appreciated.

BTW: partial data listed below

FRAM_#            0            0(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  1.008
FRAM_#          100           25(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  0.955
FRAM_#          200           50(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  0.993
FRAM_#          300           75(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  0.973
FRAM_#          400          100(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  0.988
FRAM_#          500          125(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  1.033
FRAM_#          600          150(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  1.032
FRAM_#          700          175(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  0.986
FRAM_#          800          200(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  1.061
FRAM_#          900          225(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  1.078
FRAM_#         1000          250(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  0.922
FRAM_#         1100          275(fs)  CN= 2 PRMRYTGT     14689      H      17402      O  1.257     15449      O  1.430
FRAM_#       303200        75800(fs)  CN= 0 PRMRYTGT_BD     14689      H
FRAM_#       921200       230300(fs)  CN= 1 PRMRYTGT_BD     14689      H        8375      O  1.062
FRAM_#      1078700       269675(fs)  CN= 1 PRMRYTGT_BD     14689      H       12971      O  1.507
FRAM_#     18203400      4550850(fs)  CN= 1 PRMRYTGT_BD     14689      H       16172      O  1.507

I hope to get output like below:

FRAM_#,0,0(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.008
FRAM_#,100,25(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.955
FRAM_#,200,50(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.993
FRAM_#,300,75(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.973
FRAM_#,400,100(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.988
FRAM_#,500,125(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.033
FRAM_#,600,150(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.032
FRAM_#,700,175(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.986
FRAM_#,800,200(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.061
FRAM_#,900,225(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.078
FRAM_#,1000,250(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.922
FRAM_#,1100,275(fs),CN=,2,PRMRYTGT,14689,H,17402,O,1.257,15449,O,1.430
Leon
  • 444
  • 2
  • 15

3 Answers3

2

I'm afraid your Perl scripting skills are a bit limited, too. Anyway, this is all easy enough to do natively in Python.

import re

new_dat_file = "14689_bondlength.prmfrm.dat"
new_csv_file = "14689_bondlength.prmfrm.csv"
with open(new_dat_file) as input, open(new_csv_file, 'w') as output:
    for line in input:
        line = line.rstrip('\n')
        line = re.sub(r'\\t\s+|\s+', ',', line)
        # line = line.replace(',FRAM', r'\\nFRAM')
        output.write(line + '\n')

The double backslashes look fishy, but I assume that's really what you actually want. \\t is a backslash and a lowercase t, whereas \t represents a tab.

I updated this to chomp off the final newline, then put it back after the regex replace. With that, there's no need to put a newline before FRAM.

If you really, really wanted to use an external process, I'd go with

import subprocess

with open(new_dat_file) as input, open(new_csv_file, 'w') as output:
    subprocess.run(['perl', '-p', '-e', 's/\\t\s+|\s+/,/g; s/,FRAM/\\nFRAM/g'],
    stidin=input, stdout=output, check=True)

and maybe similarly to above, switch to perl -l to avoid having the newlines replaced so you have to put them back again.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I can't tell if that's right or wrong. Posting this in a comment doesn't work very well; but perhaps you can articulate what should be different? – tripleee Feb 16 '20 at 17:32
  • I hope that the newline starts from FRAM. Please refer my question I updated. – Leon Feb 16 '20 at 17:33
  • Not blindly replacing `\s` with `\n` in the first place. would seem like absolutely the simplest fix. – tripleee Feb 16 '20 at 17:35
  • You have the same bug in your Perl script, and really don't need to invoke Perl twice to perform two substitutions. – tripleee Feb 16 '20 at 17:39
2

This can be easily done in perl. Recall \s includes the \n\t\r. You need \h which is any horizontal space and does not include line feed

just do

 perl -pe 's/\h+/,/g' $new_dat_file > $new_csv_file
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • yes it is true to run it in Perl. However, I am learning Python to use the ML library in Python. So I have to convert my research data format to the one that can be accepted by Python. – Leon Feb 16 '20 at 17:47
  • @Leon. This is also readable in python. Actually both the `.dat` and `.csv` are readable in python. The code you did has nothing associated with ML. you are just transforming data from one format to the other using python. – Onyambu Feb 16 '20 at 17:49
  • I import the data to Pandas or Scikit-learn. There are error messages that drove me to find a curved way. Any suggestions would be highly appreciated. Please refer my another question. https://stackoverflow.com/questions/60243545/how-to-import-data-file-with-different-column-numbers-using-panda-or-scipy?noredirect=1#comment106574159_60243545 – Leon Feb 16 '20 at 17:53
-1

This task is very simple - replace all sequential spaces [ +] to comma [,] -- done.

use strict;
use warnings;
use feature 'say';

my $infile  = shift || die 'Provide input file';
my $outfile = shift || die 'Provide output file';

my $fh;     # filehandle

open $fh, '<', $infile
    or die "Couldn't open $infile";

my @data = <$fh>;

close $fh;

open $fh, '>', $outfile
    or die "Couldn't open $outfile";

for (@data) {
    chomp;
    s/ +/,/g;
    say $fh $_;
}

close $fh;

Output

FRAM_#,0,0(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.008
FRAM_#,100,25(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.955
FRAM_#,200,50(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.993
FRAM_#,300,75(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.973
FRAM_#,400,100(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.988
FRAM_#,500,125(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.033
FRAM_#,600,150(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.032
FRAM_#,700,175(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.986
FRAM_#,800,200(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.061
FRAM_#,900,225(fs),CN=,1,PRMRYTGT,14689,H,15449,O,1.078
FRAM_#,1000,250(fs),CN=,1,PRMRYTGT,14689,H,15449,O,0.922
FRAM_#,1100,275(fs),CN=,2,PRMRYTGT,14689,H,17402,O,1.257,15449,O,1.430
FRAM_#,303200,75800(fs),CN=,0,PRMRYTGT_BD,14689,H
FRAM_#,921200,230300(fs),CN=,1,PRMRYTGT_BD,14689,H,8375,O,1.062
FRAM_#,1078700,269675(fs),CN=,1,PRMRYTGT_BD,14689,H,12971,O,1.507
FRAM_#,18203400,4550850(fs),CN=,1,PRMRYTGT_BD,14689,H,16172,O,1.507
Polar Bear
  • 6,762
  • 1
  • 5
  • 12