0

Each of my research data files "*.dat" has up to 2000000 data lines. The column number of each data line may be different from each other. Below is an example.

FRAM_#            0            0(fs)  CN= 1 PRMRYTGT     14689      H      15449      O  1.008
FRAM_#         1100          275(fs)  CN= 2 PRMRYTGT     14689      H      17402      O  1.257     15449      O  1.430
FRAM_#       303200        75800(fs)  CN= 0 PRMRYTGT_BD     14689      H
FRAM_#       921200       230300(fs)  CN= 1 PRMRYTGT_BD     14689      H        8375      O  1.062
FRAM_#      1078700       269675(fs)  CN= 1 PRMRYTGT_BD     14689      H       12971      O  1.507
FRAM_#     18203400      4550850(fs)  CN= 1 PRMRYTGT_BD     14689      H       16172      O  1.507

Each column is separated by "". How can I read data like above using Panda or Scipy or any other powerful modules? In addition, it might exist duplicated data. If it is, how can I filter those duplicated data? Any further suggestion would be highly appreciated.

Leon
  • 444
  • 2
  • 15
  • It's not clear from your question how this should be parsed. By the looks of it, some columns are sometimes empty, but what's the mechanism? Are some or all fields fixed width? What's the expected result? – tripleee Feb 16 '20 at 17:44
  • Also, probably remove the question about duplicates; we really want one question per question, please. – tripleee Feb 16 '20 at 17:45
  • I asked such questions to use ML modules, like Pandas, Scikit-learn to analyze my research data. However, the data is not well formatted. This question was asked yesterday when I tried to import the data which may have different number of columns. I do not know how to import the "abnormal" format into Python Pandas. Any suggestion would be highly appreciated. – Leon Feb 16 '20 at 17:50
  • Neither can we until you explain the expected result. (Maybe not even then.) – tripleee Feb 16 '20 at 18:13
  • Do you know the number of columns? ie the row with the most elements? – Onyambu Feb 16 '20 at 18:15
  • Possible duplicate of https://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python – tripleee Feb 16 '20 at 18:16
  • Possible duplicate of https://stackoverflow.com/questions/52882506/fixed-width-file-manipulation-in-pandas – tripleee Feb 16 '20 at 18:16
  • @Onyambu, the column number keeps changing in different files but can be found using awk command. It usually changes from 8 to 21 columns. But from 9 to 21, it will increase by steps of three. For an instance, you can see the data in my question, the first one has 8 columns, but the second one has 14 (=8+2x3), and the 5th line has 11 = (8+1x3) columns. – Leon Feb 16 '20 at 18:30
  • Leon You just need to know the maximum number of columns. You do not need to know the columns per row. So lets say the maximum is 21. then you could read in the data using pandas as `pd.read_csv('your_data.dat',sep = '\\s+', names = range(22))` – Onyambu Feb 16 '20 at 18:33
  • Hi, Onyambu, I sincerely appreciate your suggestion. It works. Thanks again. If you want, please put your comments in answer box so I can mark it as the best answer. Thanks again. – Leon Feb 17 '20 at 14:45

0 Answers0