2

I have been trying to learn regex and once again I got stuck.

What I am trying to scrape is a value of:

var preloadedItems = [
{
  "id": "8971",
  "permalink": "https://www.randomsite1.com"
},
{
  "id": "8943",
  "permalink": "https://www.randomsit2e.com"
},
{
  "id": "8944",
  "permalink": "https://www.randoms3ite.com"
},
{
  "id": "8950",
  "permalink": "https://www.random4site.com"
},
{
  "id": "8910",
  "permalink": "https://www.random5site.com"
},
{
  "id": "8915",
  "permalink": "https://www.rando6msite.com"
}
];

#The code is pretty long so I have not posted everything here.

which I get by doing

p = re.compile(r'var preloadedItems = \[(.*?)\];', re.DOTALL)
data = p.findall(req.text)[0]

which returns me the whole value of the json I posted. However I want to scrape only all permalink into a list and I tried to do

p = re.compile(r'var preloadedItems = \[(.*?)\];', re.DOTALL)
data = json.loads(p.findall(r.text)[0]).items()

but I do get an error of Extra data: line 1 column 2657 (char 2656)

and I wonder how I am able to scrape all permalinks into a list?


Update:

My thought was to scrape the json value first using regex to be able to use it later on as json.loads(regexValue) - Meaning thaht I use regex to grab the value Regexjson = {....} and after that using json.loads(Regexjson)...

halfer
  • 19,824
  • 17
  • 99
  • 186
Thrillofit86
  • 599
  • 2
  • 7
  • 20

1 Answers1

1

I needed to move your regex grouping (( )) to get this to work. I also switched findall(...) to search(...) assuming there is only one entry you are extracting.

import re
import json

with open('test.txt', 'r') as f:
    text = f.read() # Getting your text from a make shift file

p = re.compile(r'var preloadedItems = (\[.*?\]);', re.DOTALL)
data = p.search(text)
if data:
    json_output = json.loads(data[1])
    print(json.dumps(json_output, indent=2))

Output:

[
  {
    "id": "8971",
    "permalink": "https://www.randomsite1.com"
  },
  {
    "id": "8943",
    "permalink": "https://www.randomsit2e.com"
  },
  {
    "id": "8944",
    "permalink": "https://www.randoms3ite.com"
  },
  {
    "id": "8950",
    "permalink": "https://www.random4site.com"
  },
  {
    "id": "8910",
    "permalink": "https://www.random5site.com"
  },
  {
    "id": "8915",
    "permalink": "https://www.rando6msite.com"
  }
]