0

I need help in extracting the src values from the text ( eg: LOC/IMG.png ) . Any optimal approach to do this , as I have a file count to over 10^5 files .

I have JSON as follows :

{"Items":[{src=\"LOC/IMG.png\"}]}
Captain Barbossa
  • 1,067
  • 10
  • 16
  • is this a string or a dictionary? – Avinash Raj Apr 13 '15 at 07:43
  • Json output stored in files , file count over 1 lakh . – Captain Barbossa Apr 13 '15 at 07:45
  • This isn't "mixed response of HTML and JSON", it's just JSON, some of the members of which are strings that appear to be some form of pre-processed HTML. The right way to parse this would be to parse the JSON, look at the strings that you want to look at, then decode those into HTML fragments, then search those. – abarnert Apr 13 '15 at 07:47
  • Yes , some params have mixed HTML string in them . File format is json as mentioned – Captain Barbossa Apr 13 '15 at 07:50
  • 1
    Also, what is 1 Lakh? Is it like 173.6 gross of bakers' dozens or something? – abarnert Apr 13 '15 at 07:50
  • @abarnet edited the question . File count is over 10^5 . – Captain Barbossa Apr 13 '15 at 07:56
  • 10^5 files isn't that many. How "optimal" does this have to be? If reading each file takes 270ms, and parsing each file adds another 110ms, is that unacceptable? (That's how long `json.loads` on 2K of random JSON plus `bs4.BeautifulSoup` on 2K of random HTML takes on my laptop…) – abarnert Apr 13 '15 at 08:02
  • The file size is large , the above mentioned file was a sample file . – Captain Barbossa Apr 13 '15 at 08:17

2 Answers2

1

Let me put a disclaimer for parserers: I do not claim regexes are the coolest, and I myself use XML/JSON parsers everywhere when I can. However, when it comes to any malformed texts, parsers usually cannot handle those cases the qay I want. I have to add regexish code to deal with those situations.

So, in case a regex is absolutely necessary, use (?<=src=\\").*?(?=\\")" regex. (?<=src=\\") look-behind and look-ahead (?=\") will act as boundaries for the values inside src attributes.

Here is sample code:

import re
p = re.compile(ur'(?<=src=\\").*?(?=\\")')
test_str = "YOUR_STRING"
re.findall(p, test_str)

See demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    Besides the fact that [parsing HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) is already bad, parsing HTML that's been escaped in some unknown way and that's embedded in another format that may contain the exact same patterns you're looking for is even worse. – abarnert Apr 13 '15 at 07:53
1

You have JSON that contains some values that are HTML. If at all possible, therefore, you should parse the JSON as JSON, then parse the HTML values as HTML. This requires you to understand a tiny bit about the structure of the data—but that's a good thing to understand anyway.

For example:

j = json.loads(s)
for item in j['Items']:
    soup = bs4.BeautifulSoup(item['Item'])
    for img in soup.find_all('img'):
        yield img['src']

This may be too slow, but it only takes a couple minutes to write the correct code, run it on 1000 random representative files, then figure out if it will be fast enough when extrapolated to whatever "file count of 1 Lakh" is. If it's fast enough, then do it this way; all else being equal, it's always better to be correct and simple than to be kludgy or complicated, and you'll save time if unexpected data show up as errors right off the bat than if they show up as incorrect results that you don't notice until a week later…

If your files are about 2K, like your example, my laptop can json.loads 2K of random JSON and BeautifulSoup 2K of random HTML in less time than it takes to read 2K off a hard drive, so at worse this will take only twice as long as reading the data and doing nothing. If you have a slow CPU and a fast SSD, or if your data are very unusual, etc., that may not be true (that's why you test, instead of guessing), but I think you'll be fine.

abarnert
  • 354,177
  • 51
  • 601
  • 671