Extract numbers between string and second empty line : BASH and python

Question

Question is similar to many previous questions on SO. But seems distinct enough.

I have data file which has following output. The numbers are to be extracted. The number of elements in the number block is random and there is one empty line above and below the number block. Aim is to extract the numbers and possibly assign them to python numpy array.

string 1 

234034 6361234 45096 12342134 2878814 456456
125294 7341234 17234 23135   768234  54134123
213203 6.25 2.36 1.0 0.0021 

string 2 

298034 20481234 45096 12502134 2870814 456456
19875294 441284 98234 27897135 251021524  768234  54134123
2.3261

string 3 

744034 6644034 75096 5302134 298978814 456456
6767294 70441234 330234 200135   867234  54004123
204203 22015 120158 125 21  625 11 5 2.021

Expected output : Numbers from all blocks arranged as bash arrays or numpy(python) arrays. Numeric values shown below are only representative.

Bash array : '744034','6644034','75096', .. .. '21','625','11','5','2.021'

or

Numpy array : [744034,6644034,75....,625,11,5,2.021]

My use case prefers numpy array though.

Taking cue from previous question, tried this sed -n '/^symmetry 1$/,/^symmetry 2$/p' file but the output is null possibly due to space in the start and end search terms.

Tried python, since eventually I need the numbers as np array. From the question and help in comments, I get one block using the following code

import sys
import re
F=open(sys.argv[1])
text=F.read()
reg=re.compile(r'string 1(.*?)string 2',re.DOTALL)
for match in reg.finditer(text):
    print (match.groups())

output,

string 1 

744034 6644034 75096 5302134 298978814 456456
6767294 70441234 330234 200135   867234  54004123
204203 22015 120158 125 21  625 11 5 2.021

 string 2

Need suggestions.

`print match.groups()[0]` => `print(match.group())`, and the regex must be `r'xxx(.*?)yyy'` — Wiktor Stribiżew, Mar 31 '20 at 13:04

score 1 · Accepted Answer · answered Mar 31 '20 at 13:22

1

If I understood well, this could help:

>>> [np.array(block.split()).astype(float)  # good blocks get parsed into np arrays
     for block in file_content.split("\n\n")  # split by empty lines
     if not block[0].isalpha()]  # avoid string lines                                                                                   

[array([2.3403400e+05, 6.3612340e+06, 4.5096000e+04, 1.2342134e+07,
        2.8788140e+06, 4.5645600e+05, 1.2529400e+05, 7.3412340e+06,
        1.7234000e+04, 2.3135000e+04, 7.6823400e+05, 5.4134123e+07,
        2.1320300e+05, 6.2500000e+00, 2.3600000e+00, 1.0000000e+00,
        2.1000000e-03]),
 array([2.98034000e+05, 2.04812340e+07, 4.50960000e+04, 1.25021340e+07,
        2.87081400e+06, 4.56456000e+05, 1.98752940e+07, 4.41284000e+05,
        9.82340000e+04, 2.78971350e+07, 2.51021524e+08, 7.68234000e+05,
        5.41341230e+07, 2.32610000e+00]),
 array([7.44034000e+05, 6.64403400e+06, 7.50960000e+04, 5.30213400e+06,
        2.98978814e+08, 4.56456000e+05, 6.76729400e+06, 7.04412340e+07,
        3.30234000e+05, 2.00135000e+05, 8.67234000e+05, 5.40041230e+07,
        2.04203000e+05, 2.20150000e+04, 1.20158000e+05, 1.25000000e+02,
        2.10000000e+01, 6.25000000e+02, 1.10000000e+01, 5.00000000e+00,
        2.02100000e+00])]

answered Mar 31 '20 at 13:22

arnaud

3,293
1
10
27

Seems ok. Will try out. – ankit7540 Mar 31 '20 at 13:25
Maybe the filter to avoid string lines is not strong enough as I only check the first character, and a string line could perhaps start by a number. Tell me if it's that's a problem. – arnaud Mar 31 '20 at 13:28
@ankit7540 if the solution worked for you, you may accept the answer. Thanks – arnaud Mar 31 '20 at 13:44
My file is quite big and I am in the process of extracting the part which has relevant data. Might take some time. Sorry for delay. – ankit7540 Mar 31 '20 at 14:32
The filter to avoid string does need more, I found that the String sometimes has `-` dash in it. `isalpha()` is not catching this character I guess. – ankit7540 Mar 31 '20 at 15:14
1

after stripping the strings for `-` got it to work. – ankit7540 Mar 31 '20 at 16:30
1

You could also look with `if any([c.isalpha() for c in block[:3]]` to check among 3 first characters... Great if it worked! – arnaud Mar 31 '20 at 16:53

score 1 · Answer 2 · answered Mar 31 '20 at 13:57

1

You don't show your expected output but is this what you''re trying to do?

$ awk -v RS= '!(NR%2)' file
234034 6361234 45096 12342134 2878814 456456
125294 7341234 17234 23135   768234  54134123
213203 6.25 2.36 1.0 0.0021
298034 20481234 45096 12502134 2870814 456456
19875294 441284 98234 27897135 251021524  768234  54134123
2.3261
744034 6644034 75096 5302134 298978814 456456
6767294 70441234 330234 200135   867234  54004123
204203 22015 120158 125 21  625 11 5 2.021

or maybe one of these (or something else - do tell....):

$ awk -v RS= -v ORS='\n\n' '!(NR%2)' file
234034 6361234 45096 12342134 2878814 456456
125294 7341234 17234 23135   768234  54134123
213203 6.25 2.36 1.0 0.0021

298034 20481234 45096 12502134 2870814 456456
19875294 441284 98234 27897135 251021524  768234  54134123
2.3261

744034 6644034 75096 5302134 298978814 456456
6767294 70441234 330234 200135   867234  54004123
204203 22015 120158 125 21  625 11 5 2.021

.

$ awk -v RS= -v OFS='\n' '!(NR%2){$1=$1; print}' file
234034
6361234
45096
12342134
2878814
456456
125294
7341234
17234
23135
768234
54134123
213203
6.25
2.36
1.0
0.0021
298034
20481234
45096
12502134
2870814
456456
19875294
441284
98234
27897135
251021524
768234
54134123
2.3261
744034
6644034
75096
5302134
298978814
456456
6767294
70441234
330234
200135
867234
54004123
204203
22015
120158
125
21
625
11
5
2.021

answered Mar 31 '20 at 13:57

Ed Morton

188,023
17
78
185

I am trying to have the numbers form different blocks as arrays. Since the number of elements in the block is not uniform the awk approach is not well suited. – ankit7540 Mar 31 '20 at 14:26
I don't know what that means. If you [edit] your question to show the expected output, whatever it is, I can show you how to get that output using awk. I see something under the heading "output" in your question but I don't see how the values there correlate to the values in the input you provided (where did `-20.73386803` come from, for example?) and I see "string1" appearing there when you said you just want to get the numbers so it's not clear to me that that is the actual desired output given your posted sample input or, if it is, how it's being mapped from one to the other. – Ed Morton Mar 31 '20 at 14:39
Thanks for the comment. Sorry my bad. The output file has several 10s of such blocks and I showed one such block's output. Will fix it. Desired output has been modified. – ankit7540 Mar 31 '20 at 14:53
So your expected output under "string1" is the block of numbers that are under "string3" in the input unchanged? I honestly have no idea what it is you're trying to do, sorry. – Ed Morton Mar 31 '20 at 14:57
sorry for bad communication on my side, expected output is all number blocks as arrays. – ankit7540 Mar 31 '20 at 14:59
Change `'!(NR%2)'` to `!(NR%2)[a[++c]=$0}'` in the first script I posted and you have the data in an array. Now what? – Ed Morton Mar 31 '20 at 15:02
Command given, `awk -v RS= '!(NR%2)[a[++c]=$0}' file` output : `syntax error at or near [` and `extra }` – ankit7540 Mar 31 '20 at 15:13
Should have been `{a`, not `[a`. The main point though is that storing data in an array doesn't DO anything. Somewhere, somehow you have to produce actual output in some format - without that whether the data is stored in an array or not is pretty meaningless as it's just siting in memory inside some program. There's absolutely nothing in your question so far to indicate what you want to DO with the data (concatenate it, multiply it, write it to some file, something else), just that you think storing it an an array would be a useful starting point to doing whatever that is. – Ed Morton Mar 31 '20 at 15:32

Extract numbers between string and second empty line : BASH and python

2 Answers2