0

I need to read some data from several text files that have random number lines of text at the beginning. Typically the files look like:

file1.dat:

The file contains data
# this is a comment skip me
DataStart
  index = integer
Some text

 -5.0e-2 3.3 4.0
 0 0.0e0 0.0e0
 1.0 0.1 3.0
 1.5 4.0 1.87
 1.7 -4.67 0.124
 ...
 ...
 15.3 -3.5e02 1.775
  • At the beginning of file1.dat it may contain several lines of text that could start with spaces, tabs, etc.
  • The block of data I am interested in is always below those lines and has a fixed number of columns, in this case, it has 3 columns:
 -5.0e-2 3.3 4.0
 0 0.0e0 0.0e0
 1.0 0.1 3.0
 1.5 4.0 1.87
 1.7 -4.67 0.124
 ...
 ...
 15.3 -3.5e02 1.775

The lines containing the data could may have spaces/tabs at the start of each line.

I have tried the following code:

import numpy as np

pattern = r'^[-0-9 ]*' 
mydata = np.fromregex('file1.dat', pattern, dtype=float)

But when I run it I get:

~/.local/lib/python3.8/site-packages/numpy/lib/npyio.py in fromregex(file, regexp, dtype, encoding)
   1530             # Create the new array as a single data-type and then
   1531             #   re-interpret as a single-field structured array.
-> 1532             newdtype = np.dtype(dtype[dtype.names[0]])
   1533             output = np.array(seq, dtype=newdtype)
   1534             output.dtype = dtype

TypeError: 'NoneType' object is not subscriptable

Your help is very much appreciated

Amazigh_05
  • 241
  • 1
  • 8

3 Answers3

0

I think your regex needs to look more like this:

pattern = r'\s*([-+0-9e.]+)\s+([-+0-9e.]+)\s+([-+0-9e.]+).*'
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
0

To match a floating-point number, we can use the following regex (see this answer for details):

[+\-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+\-]?\d+)?

You need to add that inside a group () to extract the tokens from each line:

import numpy as np
from numpy.lib import recfunctions as rfn

# zero or more white spaces
opt_whitespace = r'\s*'

# The number token
number= r'([+\-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+\-]?\d+)?)'

# one or more whitespaces
whitespace= r'\s+'

# Number of data columns
N = 3

# The regex 
pattern = opt_whitespace + number + (whitespace + number)*(N-1) + opt_whitespace + r'\n'

data_type = np.dtype(','.join(['f8']*N)) # f8 means 64-bit floating-point number

data = np.fromregex('file1.dat', pattern, dtype=data_type)

data = rfn.structured_to_unstructured(data)
print(data)

Output:

[[-5.000e-02  3.300e+00  4.000e+00]
 [ 0.000e+00  0.000e+00  0.000e+00]
 [ 1.000e+00  1.000e-01  3.000e+00]
 [ 1.500e+00  4.000e+00  1.870e+00]
 [ 1.700e+00 -4.670e+00  1.240e-01]
 [ 1.530e+01 -3.500e+02  1.775e+00]]

s.ouchene
  • 1,682
  • 13
  • 31
0
In [603]: txt="""-5.0e-2 3.3 4.0
     ...:  0 0.0e0 0.0e0
     ...:  1.0 0.1 3.0
     ...:  1.5 4.0 1.87
     ...:  1.7 -4.67 0.124
     ...:  15.3 -3.5e02 1.775"""

The number layout looks regular enough to the standard csv reader:

In [604]: np.genfromtxt(txt.splitlines())
Out[604]: 
array([[-5.000e-02,  3.300e+00,  4.000e+00],
       [ 0.000e+00,  0.000e+00,  0.000e+00],
       [ 1.000e+00,  1.000e-01,  3.000e+00],
       [ 1.500e+00,  4.000e+00,  1.870e+00],
       [ 1.700e+00, -4.670e+00,  1.240e-01],
       [ 1.530e+01, -3.500e+02,  1.775e+00]])

or even line split:

In [605]: alist=[]
     ...: for line in txt.splitlines():
     ...:     alist.append(line.split())
     ...: 
In [606]: alist
Out[606]: 
[['-5.0e-2', '3.3', '4.0'],
 ['0', '0.0e0', '0.0e0'],
 ['1.0', '0.1', '3.0'],
 ['1.5', '4.0', '1.87'],
 ['1.7', '-4.67', '0.124'],
 ['15.3', '-3.5e02', '1.775']]
In [607]: np.array(alist, float)
Out[607]: 
array([[-5.000e-02,  3.300e+00,  4.000e+00],
       [ 0.000e+00,  0.000e+00,  0.000e+00],
       [ 1.000e+00,  1.000e-01,  3.000e+00],
       [ 1.500e+00,  4.000e+00,  1.870e+00],
       [ 1.700e+00, -4.670e+00,  1.240e-01],
       [ 1.530e+01, -3.500e+02,  1.775e+00]])
hpaulj
  • 221,503
  • 14
  • 230
  • 353