How can I extract a sentence in a file?

Question

I'm trying to extract a sentence in a .json file.

but there are many overlapping words so It's kind of difficult.

For example:

"inferText":"this is content that I want to extract\nblahblah...\n","inferConfidence":0.99949986,"subFields":[
{"boundingPoly":{"vertices":[{"x":133.0,"y":250.0},{"x":237.0,"y":250.0}
{"x":237.0,"y":284.0},{"x":133.0,"y":284.0}]},"inferText":"blahblah","inferConfidence":0.9994,"lineBreak":false},
{"boundingPoly":{"vertices":[{"x":244.0,"y":251.0},{"x":322.0,"y":251.0}, .....

so I tried this code but there are so many "inferText" and "inferConfidence"

    infer = re.findall(r'inferText.+inferConfidence', readline)

How can I solve? help!

Is the sentence always the value of the particular key? Can't you parse the json as json (import json) and then traverse the object until you find the "inferText" key and return it's value? — saquintes, Aug 02 '21 at 07:32

score 1 · Accepted Answer · answered Aug 02 '21 at 10:37

You can use Regular Expressions for matching patterns, and for extracting something matched in the pattern.

Using groups as described here, you can extract what's after "inferText" by using the pattern "inferText":"(\w*)"

Note: \w in python regex means

Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.

This can be tested using this code:

import re

pattern = '"inferText":"(\w*)"'
string = '"inferConfidence":0.99949986,"subFields":[{"boundingPoly":{"vertices":[{"x":133.0,"y":250.0},{"x":237.0,"y":250.0}{"x":237.0,"y":284.0},{"x":133.0,"y":284.0}]},"inferText":"blahblah","inferConfidence":0.9994,"lineBreak":false},{"boundingPoly":{"vertices":[{"x":244.0,"y":251.0},{"x":322.0,"y":251.0}, .....'

sentences_re = re.compile(pattern)
sentences = sentences_re.findall(string)
print(sentences)

Outputting:

['blahblah']

thanks for your helping! It works. but I want to include special characters such as ".", "[", "]" in string. — SeungWonBang, Aug 03 '21 at 05:22
Ok, so you just have to modify the regex pattern to `"inferText":"([\.\[\]\w]*)"`. You can test your regexes [here](https://regex101.com/). — Marte Valerio Falcone, Aug 03 '21 at 12:09

score 0 · Answer 2 · answered Aug 02 '21 at 07:32

Given that you are working with JSON content, you should be relying on Python's native json library, rather than pure regex:

inp = "{\"inferText\":\"this is content that I want to extract blahblah...\",\"inferConfidence\":0.99949986,\"subFields\":[{\"boundingPoly\":{\"vertices\":[{\"x\":133.0,\"y\":250.0},{\"x\":237.0,\"y\":250.0},{\"x\":237.0,\"y\":284.0},{\"x\":133.0,\"y\":284.0}]}}]}"
obj = json.loads(inp)
print(obj["inferText"])

This prints:

this is content that I want to extractblahblah...

How can I extract a sentence in a file?

2 Answers2