Parse a file for all occurrences of a string and generate key-values in JSON

Question

I have a file (https://pastebin.com/STgtBRS8) in which I need to search for all the occurrences of the word "silencedetect".
I then have to generate a JSON file that contains the key-values of “silence_start”, “silence_end”, and “silence_duration”.

The JSON file should look like something like this:

[
{
"id": 1,
"silence_start": -0.012381,
"silence_end": 2.2059,
"silence_duration": 2.21828
},
{
"id": 2,
"silence_start": 5.79261,
"silence_end": 6.91955,
"silence_duration": 1.12694,
}
]

This is what I have tried:

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read().replace('\n', '')

for line in data:
    if "silencedetect" in data:
        #read silence_start, silence_end, and silence_duration and put in json

I am unable to associate the 3 key-value pairs with each "silencedetect". How can I parse the key-values and get them in JSON format ?

@RomanPerekhrest: Yes but I have considered it as one. It could be .txt as well. Ignore the extension for now. — pikaraider, Jul 05 '17 at 08:35

Stael · Accepted Answer · 2017-07-05T09:17:56.850

you can regex for it. it works for me on

import re

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read()

d = re.findall('silence_start: (-?\d+\.\d+)\n.*?\n?\[silencedetect @ \w{14}\] silence_end: (-?\d+\.\d+) \| silence_duration: (-?\d+\.\d+)', data)
print d

you could put them in a json by doing

out = [{'id': i, 'start':a[0], 'end':a[1], 'duration':a[2]} for i, a in enumerate(d)]
import json
print json.dumps(out) # or write to file or... whatever

output:

'[{"duration": "2.21828", "start": "-0.012381", "end": "2.2059", "id": 0}, {"duration": "1.12694", "start": "5.79261", "end": "6.91955", "id": 1}, {"duration": "0.59288", "start": "8.53256", "end": "9.12544", "id": 2}, {"duration": "1.0805", "start": "9.64712", "end": "10.7276", "id": 3}, {"duration": "1.03406", "start": "12.6657", "end": "13.6998", "id": 4}, {"duration": "0.871519", "start": "19.2602", "end": "20.1317", "id": 5}'

EDIT: fixed fixed a bug that missed some matches because the frame=.. line fell between the start and end of the match

score 1 · Answer 2 · answered Jul 05 '17 at 08:54

Complex solution using re.findall and enumerate functions:

import re, json

with open('volume_data.txt', 'r') as f:
    result = []
    pat = re.compile(r'(silence_start: -?\d+\.\d+).+?(silence_end: -?\d+\.\d+).+?(silence_duration: -?\d+\.\d+)')
    silence_items = re.findall(pat, f.read().replace('\n', ''))
    for i,v in enumerate(silence_items):
        d = {'id': i+1}
        d.update({pair[:pair.find(':')]: float(pair[pair.find(':')+2:]) for pair in v})
        result.append(d)

    print(json.dumps(result, indent=4))

The output:

[
    {
        "id": 1,
        "silence_end": 2.2059,
        "silence_duration": 2.21828,
        "silence_start": -0.012381
    },
    {
        "id": 2,
        "silence_end": 6.91955,
        "silence_duration": 1.12694,
        "silence_start": 5.79261
    },
    {
        "id": 3,
        "silence_end": 9.12544,
        "silence_duration": 0.59288,
        "silence_start": 8.53256
    },
    {
        "id": 4,
        "silence_end": 10.7276,
        "silence_duration": 1.0805,
        "silence_start": 9.64712
    },
    {
        "id": 5,
        "silence_end": 13.6998,
        "silence_duration": 1.03406,
        "silence_start": 12.6657
    },
    {
        "id": 6,
        "silence_end": 20.1317,
        "silence_duration": 0.871519,
        "silence_start": 19.2602
    },
    {
        "id": 7,
        "silence_end": 22.4305,
        "silence_duration": 0.801859,
        "silence_start": 21.6286
    },
    ...
]

i didn't know about `indent=4`, that's cool. out of interest, how many records do you find? — Stael, Jul 05 '17 at 08:57
i missed 4 because of the frame=... line falling in the middle — Stael, Jul 05 '17 at 09:03
@RomanPerekhrest, thanks for such a concise solution. I want to know, does JSON preserve the key-value ordering. For example, I am getting each element in the following order: { "silence_end": 596.869, "silence_duration": 0.825079, "id": 139, "silence_start": 596.044 } Can I get it in the order: ID, silence_start, silence_end, silence_duration ? — pikaraider, Jul 05 '17 at 09:13
@Mahesh you shouldn't need to do that, see: https://stackoverflow.com/questions/4515676/keep-the-order-of-the-json-keys-during-json-conversion-to-csv — Stael, Jul 05 '17 at 09:14
@Stael, thanks for the information. I now see the point in JSON entries being unordered. — pikaraider, Jul 05 '17 at 09:28

zwer · Answer 3 · 2017-07-05T09:26:49.113

Assuming your data is ordered, you can simply stream-parse it, no need for regex and loading of the whole file at all:

import json

parsed = []  # a list to hold our parsed values
with open("entries.dat", "r") as f:  # open the file for reading
    current_id = 1  # holds our ID
    entry = None  # holds the current parsed entry
    for line in f:  # ... go through the file line by line
        if line[:14] == "[silencedetect":  # parse the lines starting with [silencedetect
            if entry:  # we already picked up silence_start
                index = line.find("silence_end:")  # find where silence_end starts
                value = line[index + 12:line.find("|", index)].strip()  # the number after it
                entry["silence_end"] = float(value)  # store the silence_end
                # the following step is optional, instead of parsing you can just calculate
                # the silence_duration yourself with:
                # entry["silence_duration"] = entry["silence_end"] - entry["silence_start"]
                index = line.find("silence_duration:")  # find where silence_duration starts
                value = line[index + 17:].strip()  # grab the number after it
                entry["silence_duration"] = float(value)  # store the silence_duration
                # and now that we have everything...
                parsed.append(entry)  # add the entry to our parsed list
                entry = None  # blank out the entry for the next step
            else:  # find silence_start first
                index = line.find("silence_start:")  # find where silence_start, well, starts
                value = line[index + 14:].strip()  # grab the number after it
                entry = {"id": current_id}  # store the current ID...
                entry["silence_start"] = float(value)  # ... and the silence_start
                current_id += 1  # increase our ID value for the next entry

# Now that we have our data, we can easily turn it into JSON and print it out if needed
your_json = json.dumps(parsed, indent=4)  # holds the JSON, pretty-printed
print(your_json)  # let's print it...

And you get:

[
    {
        "silence_end": 2.2059, 
        "silence_duration": 2.21828, 
        "id": 1, 
        "silence_start": -0.012381
    }, 
    {
        "silence_end": 6.91955, 
        "silence_duration": 1.12694, 
        "id": 2, 
        "silence_start": 5.79261
    }, 
    {
        "silence_end": 9.12544, 
        "silence_duration": 0.59288, 
        "id": 3, 
        "silence_start": 8.53256
    }, 
    {
        "silence_end": 10.7276, 
        "silence_duration": 1.0805, 
        "id": 4, 
        "silence_start": 9.64712
    }, 
    # 
    # etc.
    # 
    {
        "silence_end": 795.516, 
        "silence_duration": 0.68576, 
        "id": 189, 
        "silence_start": 794.83
    }
]

Keep in mind that JSON doesn't subscribe order of data (nor does Python dict before v3.5) so the id won't necessarily appear at the first place but the data validity is just the same.

I've purposefully separated the initial entry creation so you can use collections.OrderedDict as a drop-in replacement (i.e. entry = collections.OrderedDict({"id": current_id})) to preserve the order if that's what you wish.

Ando Jurai · Answer 4 · 2017-07-05T09:36:33.957

import re import json

with open('volume_data.csv', 'r') as myfile: data = myfile.read()

matcher = re.compile('(?P<g1>[silencedetect @ \w+?\])\s+?silence_start:\s+?(?P<g2>-?\d+?\.\d+?).*?\n([^\[]+?\n)?(?P=g1)\s+?silence_end:\s+?(?P<g3>-?\d+?\.\d+?).+?\|\s+?silence_duration:\s+?(?P<g4>-?\d+?\.\d+?).*?\n')
matchiter= matcher.findall(data)
#(1) (2)
string=""
for i, matchediter in enumerate( matchiter):
    string+= '{"id": {},\n, "silence_start":{},\n"silence_end": {},\n"silence_duration":{}}'. format(i, matchediter.group(g2),matchediter.group(g3),matchediter.group(g4)).

json.dumps(string)

(1) You might want to pass some flags like "re.IGNORECASE" to make your script immune to such changes.

(2) I prefer using non greedy sequences recognition patterns, it may have an impact on recognition and speed. Use of named groups is a matter of personal taste. They can be of use if you decide instead to use a matcher.sub operation to reformat the read() at once, instead of using iteration to rebuild the file text. I could add the replacement string if you can't figure it out. Else I prefer using .group of match object, it is made for this and may use the names you will choose instead of g1, g2, g3, g4.

Overall I prefer using finditer as it is basically made for this kind of operation, findall yields tuples of captured groups, that is nice but you may sometimes want to use infos relative to the complete match, pattern, positional index in the analyzed text, etc.

Edit: I made the regex robust to any string added after duration figures, and to multiple spaces. I also take the intercalated lines into account, you can capture them by naming the group if you want. It captures 189 occurrences, there are 190 "silence start" but the last one is not followed by end and duration info.

you've done what I did at first. notice that there is a line that looks like `[silencedetect @ 0x7fe7a4f00ac0] silence_start: 732.925 frame=22123 fps=1001 q=-0.0 size=N/A time=00:12:18.17 bitrate=N/A speed=33.4x [silencedetect @ 0x7fe7a4f00ac0] silence_end: 738.673 | silence_duration: 5.74771` which i don't think you match? — Stael, Jul 05 '17 at 09:10
also, even after a few syntax errors (there is a ' out of place and you forgot to \ one of your [) i get `bad character in backref group name ''` — Stael, Jul 05 '17 at 09:13
I did not take your post as example, I took long to write mine correctly, actually, because I am more used to ask question than answering these here. You have a relevant point on this line, I can't match it. But anyway I prefered starting my regex with the silencedetect part out of consistency. I will edit again my post when able to do so, to take supplementary characters into account. For errors, seems that I didn't copy the right expression from my clipboard. Silly me. — Ando Jurai, Jul 05 '17 at 09:14
:) don't worry about it, we all make errors - and I was actually saying you made the same error (seperately) to the one i made. Nice to hear people moving from asking to answering - I mostly answer questions because it forces me to learn things, so I hope you enjoy it. — Stael, Jul 05 '17 at 09:17
I do. And actually, I am not planning to move from my asking scheme, just that I know enough of regex to usually sport myself not badly like this;) — Ando Jurai, Jul 05 '17 at 09:38

Parse a file for all occurrences of a string and generate key-values in JSON

4 Answers4