Assuming your data is ordered, you can simply stream-parse it, no need for regex and loading of the whole file at all:
import json
parsed = [] # a list to hold our parsed values
with open("entries.dat", "r") as f: # open the file for reading
current_id = 1 # holds our ID
entry = None # holds the current parsed entry
for line in f: # ... go through the file line by line
if line[:14] == "[silencedetect": # parse the lines starting with [silencedetect
if entry: # we already picked up silence_start
index = line.find("silence_end:") # find where silence_end starts
value = line[index + 12:line.find("|", index)].strip() # the number after it
entry["silence_end"] = float(value) # store the silence_end
# the following step is optional, instead of parsing you can just calculate
# the silence_duration yourself with:
# entry["silence_duration"] = entry["silence_end"] - entry["silence_start"]
index = line.find("silence_duration:") # find where silence_duration starts
value = line[index + 17:].strip() # grab the number after it
entry["silence_duration"] = float(value) # store the silence_duration
# and now that we have everything...
parsed.append(entry) # add the entry to our parsed list
entry = None # blank out the entry for the next step
else: # find silence_start first
index = line.find("silence_start:") # find where silence_start, well, starts
value = line[index + 14:].strip() # grab the number after it
entry = {"id": current_id} # store the current ID...
entry["silence_start"] = float(value) # ... and the silence_start
current_id += 1 # increase our ID value for the next entry
# Now that we have our data, we can easily turn it into JSON and print it out if needed
your_json = json.dumps(parsed, indent=4) # holds the JSON, pretty-printed
print(your_json) # let's print it...
And you get:
[
{
"silence_end": 2.2059,
"silence_duration": 2.21828,
"id": 1,
"silence_start": -0.012381
},
{
"silence_end": 6.91955,
"silence_duration": 1.12694,
"id": 2,
"silence_start": 5.79261
},
{
"silence_end": 9.12544,
"silence_duration": 0.59288,
"id": 3,
"silence_start": 8.53256
},
{
"silence_end": 10.7276,
"silence_duration": 1.0805,
"id": 4,
"silence_start": 9.64712
},
#
# etc.
#
{
"silence_end": 795.516,
"silence_duration": 0.68576,
"id": 189,
"silence_start": 794.83
}
]
Keep in mind that JSON doesn't subscribe order of data (nor does Python dict
before v3.5) so the id
won't necessarily appear at the first place but the data validity is just the same.
I've purposefully separated the initial entry
creation so you can use collections.OrderedDict
as a drop-in replacement (i.e. entry = collections.OrderedDict({"id": current_id})
) to preserve the order if that's what you wish.