How to use regex scraping json values

Question

I have been trying to learn regex and once again I got stuck.

What I am trying to scrape is a value of:

var preloadedItems = [
{
  "id": "8971",
  "permalink": "https://www.randomsite1.com"
},
{
  "id": "8943",
  "permalink": "https://www.randomsit2e.com"
},
{
  "id": "8944",
  "permalink": "https://www.randoms3ite.com"
},
{
  "id": "8950",
  "permalink": "https://www.random4site.com"
},
{
  "id": "8910",
  "permalink": "https://www.random5site.com"
},
{
  "id": "8915",
  "permalink": "https://www.rando6msite.com"
}
];

#The code is pretty long so I have not posted everything here.

which I get by doing

p = re.compile(r'var preloadedItems = \[(.*?)\];', re.DOTALL)
data = p.findall(req.text)[0]

which returns me the whole value of the json I posted. However I want to scrape only all permalink into a list and I tried to do

p = re.compile(r'var preloadedItems = \[(.*?)\];', re.DOTALL)
data = json.loads(p.findall(r.text)[0]).items()

but I do get an error of Extra data: line 1 column 2657 (char 2656)

and I wonder how I am able to scrape all permalinks into a list?

Update:

My thought was to scrape the json value first using regex to be able to use it later on as json.loads(regexValue) - Meaning thaht I use regex to grab the value Regexjson = {....} and after that using json.loads(Regexjson)...

What are you trying to match? There is no `var preloadedItems` in that json and why are you `json.loads`ing a regex findall result? — Error - Syntactical Remorse, Aug 05 '19 at 17:31
Woopsie! I forgot to add that ouchh! Will add that right away — Thrillofit86, Aug 05 '19 at 17:33
Regex is not suitable tool for such cases, simply use a JSON parser and get values — Code Maniac, Aug 05 '19 at 17:33
Hmm but I do need to scrape the value first and then use it as JSON parser? No? My plan was to use regex to be able to scrape the JSON value and then use JSON Parser. — Thrillofit86, Aug 05 '19 at 17:35
@Thrillofit86 Your edit makes the question much different than before. Are you trying to scrape a JavaScript file? — Error - Syntactical Remorse, Aug 05 '19 at 17:36
Alright so what I tried to do at first was to scrape the json value using regex `var preloadedItems = \[(.*?)\];` where I after that use the json.loads(regexValue) to be able to do whatever I want with the json values. @Error-SyntacticalRemorse — Thrillofit86, Aug 05 '19 at 17:37

score 1 · Accepted Answer · answered Aug 05 '19 at 17:47

I needed to move your regex grouping (( )) to get this to work. I also switched findall(...) to search(...) assuming there is only one entry you are extracting.

import re
import json

with open('test.txt', 'r') as f:
    text = f.read() # Getting your text from a make shift file

p = re.compile(r'var preloadedItems = (\[.*?\]);', re.DOTALL)
data = p.search(text)
if data:
    json_output = json.loads(data[1])
    print(json.dumps(json_output, indent=2))

Output:

[
  {
    "id": "8971",
    "permalink": "https://www.randomsite1.com"
  },
  {
    "id": "8943",
    "permalink": "https://www.randomsit2e.com"
  },
  {
    "id": "8944",
    "permalink": "https://www.randoms3ite.com"
  },
  {
    "id": "8950",
    "permalink": "https://www.random4site.com"
  },
  {
    "id": "8910",
    "permalink": "https://www.random5site.com"
  },
  {
    "id": "8915",
    "permalink": "https://www.rando6msite.com"
  }
]

How to use regex scraping json values

1 Answers1