I need help in extracting the src values from the text ( eg: LOC/IMG.png
) . Any optimal approach to do this , as I have a file count to over 10^5 files .
I have JSON as follows :
{"Items":[{src=\"LOC/IMG.png\"}]}
I need help in extracting the src values from the text ( eg: LOC/IMG.png
) . Any optimal approach to do this , as I have a file count to over 10^5 files .
I have JSON as follows :
{"Items":[{src=\"LOC/IMG.png\"}]}
Let me put a disclaimer for parserers: I do not claim regexes are the coolest, and I myself use XML/JSON parsers everywhere when I can. However, when it comes to any malformed texts, parsers usually cannot handle those cases the qay I want. I have to add regexish code to deal with those situations.
So, in case a regex is absolutely necessary, use (?<=src=\\").*?(?=\\")"
regex. (?<=src=\\")
look-behind and look-ahead (?=\") will act as boundaries for the values inside src
attributes.
Here is sample code:
import re
p = re.compile(ur'(?<=src=\\").*?(?=\\")')
test_str = "YOUR_STRING"
re.findall(p, test_str)
See demo.
You have JSON that contains some values that are HTML. If at all possible, therefore, you should parse the JSON as JSON, then parse the HTML values as HTML. This requires you to understand a tiny bit about the structure of the data—but that's a good thing to understand anyway.
For example:
j = json.loads(s)
for item in j['Items']:
soup = bs4.BeautifulSoup(item['Item'])
for img in soup.find_all('img'):
yield img['src']
This may be too slow, but it only takes a couple minutes to write the correct code, run it on 1000 random representative files, then figure out if it will be fast enough when extrapolated to whatever "file count of 1 Lakh" is. If it's fast enough, then do it this way; all else being equal, it's always better to be correct and simple than to be kludgy or complicated, and you'll save time if unexpected data show up as errors right off the bat than if they show up as incorrect results that you don't notice until a week later…
If your files are about 2K, like your example, my laptop can json.loads
2K of random JSON and BeautifulSoup
2K of random HTML in less time than it takes to read 2K off a hard drive, so at worse this will take only twice as long as reading the data and doing nothing. If you have a slow CPU and a fast SSD, or if your data are very unusual, etc., that may not be true (that's why you test, instead of guessing), but I think you'll be fine.