-1

I'm trying to extract a sentence in a .json file.

but there are many overlapping words so It's kind of difficult.

For example:

"inferText":"this is content that I want to extract\nblahblah...\n","inferConfidence":0.99949986,"subFields":[
{"boundingPoly":{"vertices":[{"x":133.0,"y":250.0},{"x":237.0,"y":250.0}
{"x":237.0,"y":284.0},{"x":133.0,"y":284.0}]},"inferText":"blahblah","inferConfidence":0.9994,"lineBreak":false},
{"boundingPoly":{"vertices":[{"x":244.0,"y":251.0},{"x":322.0,"y":251.0}, .....

so I tried this code but there are so many "inferText" and "inferConfidence"

    infer = re.findall(r'inferText.+inferConfidence', readline)

How can I solve? help!

  • 2
    Is the sentence always the value of the particular key? Can't you parse the json as json (import json) and then traverse the object until you find the "inferText" key and return it's value? – saquintes Aug 02 '21 at 07:32

2 Answers2

1

You can use Regular Expressions for matching patterns, and for extracting something matched in the pattern.

Using groups as described here, you can extract what's after "inferText" by using the pattern "inferText":"(\w*)"

Note: \w in python regex means

Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.

This can be tested using this code:

import re

pattern = '"inferText":"(\w*)"'
string = '"inferConfidence":0.99949986,"subFields":[{"boundingPoly":{"vertices":[{"x":133.0,"y":250.0},{"x":237.0,"y":250.0}{"x":237.0,"y":284.0},{"x":133.0,"y":284.0}]},"inferText":"blahblah","inferConfidence":0.9994,"lineBreak":false},{"boundingPoly":{"vertices":[{"x":244.0,"y":251.0},{"x":322.0,"y":251.0}, .....'

sentences_re = re.compile(pattern)
sentences = sentences_re.findall(string)
print(sentences)

Outputting:

['blahblah']
0

Given that you are working with JSON content, you should be relying on Python's native json library, rather than pure regex:

inp = "{\"inferText\":\"this is content that I want to extract blahblah...\",\"inferConfidence\":0.99949986,\"subFields\":[{\"boundingPoly\":{\"vertices\":[{\"x\":133.0,\"y\":250.0},{\"x\":237.0,\"y\":250.0},{\"x\":237.0,\"y\":284.0},{\"x\":133.0,\"y\":284.0}]}}]}"
obj = json.loads(inp)
print(obj["inferText"])

This prints:

this is content that I want to extractblahblah...
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360