1

I have a matrix written in this format inside a log file:

2014-09-08 14:10:20,107 - root - INFO - [[  8.30857546   0.69993454   0.20645551  
77.01797674  13.76705776]
 [  8.35205432   0.53417203   0.19969048  76.78598173  14.12810144]
 [  8.37066492   0.64428449   0.18623849  76.4181809   14.3806312 ]
 [  8.50493296   0.5110043    0.19731849  76.45838604  14.32835821]
 [  8.18900791   0.4955451    0.22524777  76.96966663  14.12053259]]
...some text 
2014-09-08 14:12:22,211 - root - INFO - [[  3.25142253e+01   1.11788106e+00   1.51065008e-02   6.16496299e+01
    4.70315726e+00]
 [  3.31685887e+01   9.53522041e-01   1.49767860e-02   6.13449154e+01
    4.51799710e+00]
 [  3.31101827e+01   1.09729703e+00   5.03347259e-03   6.11818594e+01
    4.60562742e+00]
 [  3.32506957e+01   1.13837592e+00   1.51783456e-02   6.08651657e+01
    4.73058437e+00]
 [  3.26809490e+01   1.06617279e+00   1.00110121e-02   6.17429172e+01
    4.49994994e+00]]

I am writing this matrix using the python logging package:

logging.info(conf_mat)

However, logging.info does not show me a method to write the matrix in a float %.3f format. So I decided to parse the log file this way:

conf_mat = [[]]
cf = '[+-]?(?=\d*[.eE])(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?'

with open(sys.argv[1]) as f:
    for line in f:
        epoch = re.findall(ep, line) # find lines starting with epoch for other stuff
        if epoch:
            error_line = next(f) # grab the next line, which is the error line
            error_value = error_line[error_line.rfind('=')+1:]
            data_points.append(map(float,epoch[0]+(error_value,))) #get the error value for the specific epoch
            for i in range(N):
                cnf_mline = next(f)
                match = re.findall(cf, cnf_mline)
                if match:
                    conf_mat[count].append(map(float,match))
                else:
                    conf_mat.append([])
                    count += 1

However, the regex does not catch the break in the line when looking at the matrix, when I try to convert the matrix using

conf_mtx = np.array(conf_mat)
cuda_hpc80
  • 557
  • 2
  • 7
  • 15
  • is the log file produced by some application controlled by you? If yes it would be better to make an easier format to be read... like using a common comment string and removing the ``[[``, ``[``, ``]``, ``]]``... – Saullo G. P. Castro Sep 18 '14 at 06:05

1 Answers1

1

Your regex string cf needs to be a raw string literal:

cf = r'[+-]?(?=\d*[.eE])(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?'

in order to work properly. Backslash \ characters are interpreted as escape sequences in "regular" strings, but should not be in regexes. You can read about raw string literals at the top of the re module's documentation, and in this excellent SO answer. Alex Martelli explains them quite well, so I won't repeat everything he says here. Suffice it to say that were you not to use a raw literal, you'd have to escape each and every one of your backslashes with another backslash, and that just gets ugly and annoying fast.

As for the rest of your code, it won't run without more information. The N in for i in range(N): is undefined, as is count a few lines later. Calling cnf_mline = next(f) really doesn't make sense at all, because you're going to run out of lines in the file (by calling next repeatedly) before you can iterate over all of them using the for line in f: command. It's unclear whether your data really has that line break in the second half where one of the members of the list is on the next line, I assume that's the case because of the next attempt.

I think you should first try to clean up your input file into a regular format, then you'll have a much easier time running regular expressions on it. In order to work on subsequent lines and not run out your generator expression with excessive uses of next(), check out itertools.tee(). It returns n independent generators from a single iterable, allowing you to advance the second a line ahead of the first. Alternatively, you could read your file's lines into a list, and just operate using indices of i, i+1. Just strip each line, join them together, and write to a new file or list. You can then go ahead and rewrite your matching loop to simply pull each number of the appropriate format out and insert it into your matrix at the correct position. The good news is your regex caught everything I threw at it, so you won't need to modify anything there.

Good luck!

Community
  • 1
  • 1
MattDMo
  • 100,794
  • 21
  • 241
  • 231
  • conf_mat prints `[[[8.30857546, 0.69993454, 0.20645551, 77.01797674, 13.76705776], [8.35205432, 0.53417203, 0.19969048, 76.78598173, 14.12810144], [8.37066492, 0.64428449, 0.18623849, 76.4181809, 14.3806312], [8.50493296, 0.5110043, 0.19731849, 76.45838604, 14.32835821], [8.18900791, 0.4955451, 0.22524777, 76.96966663, 14.12053259], [32.5142253, 1.11788106, 0.0151065008, 61.6496299], [4.70315726], [33.1685887, 0.953522041, 0.014976786, 61.3449154], [4.5179971], [33.1101827, 1.09729703, 0.00503347259, 61.1818594]]]` so the regex does not get the line break – cuda_hpc80 Sep 18 '14 at 09:43