1

So I have been working out with abit of bs4 and managed to print out a text. Right now I managed to print out var ajaxsearch which init comes alot more.

I have written a code where it prints out all that contains javascript and print out where var ajaxsearch starts withit:

  try:
        product_li_tags = bs4.find_all('script', {'type': 'text/javascript'})
    except Exception:
        product_li_tags = []

    special_code = ''
    for s in product_li_tags:
        if s.text.strip().startswith('var ajaxsearch'):
            special_code = s.text
            break

    print(special_code)

and I am getting an output of:

var ajaxsearch = false;
var combinationsFromController ={
  "224114": {
    "attributes_values": {
      "4": "5.5"
    },
    "attributes": [
      22
    ],

    "unit_impact": 0,
    "minimal_quantity": "1",
    "date_formatted": "",
    "available_date": "",
    "id_image": -1,
    "list": "'22'"
  },
  "224140": {
    "attributes_values": {
      "4": "6"
    },
    "attributes": [
      23
    ],
    "unit_impact": 0,
    "minimal_quantity": "1",
    "date_formatted": "",
    "available_date": "",
    "id_image": -1,
    "list": "'23'"
  },
  "224160": {
    "attributes_values": {
      "4": "6.5"
    },
    "attributes": [
      24
    ],
    "unit_impact": 0,
    "minimal_quantity": "1",
    "date_formatted": "",
    "available_date": "",
    "id_image": -1,
    "list": "'24'"
  },
  "224139": {
    "attributes_values": {
      "4": "7"
    },
    "attributes": [
      25
    ],
    "unit_impact": 0,
    "minimal_quantity": "1",
    "date_formatted": "",
    "available_date": "",
    "id_image": -1,
    "list": "'25'"
  },
  "224138": {
    "attributes_values": {
      "4": "7.5"
    },
    "attributes": [
      26
    ],
    "unit_impact": 0,
    "minimal_quantity": "1",
    "date_formatted": "",
    "available_date": "",
    "id_image": -1,
    "list": "'26'"
  },
  "224113": {
    "attributes_values": {
      "4": "8"
    },
    "attributes": [
      27
    ],
    "unit_impact": 0,
    "minimal_quantity": "1",
    "date_formatted": "",
    "available_date": "",
    "id_image": -1,
    "list": "'27'"
  },
  "224129": {
    "attributes_values": {
      "4": "8.5"
    },
    "attributes": [
      28
    ],
    "unit_impact": 0,
    "minimal_quantity": "1",
    "date_formatted": "",
    "available_date": "",
    "id_image": -1,
    "list": "'28'"
  },
  "224161": {
    "attributes_values": {
      "4": "9"
    },
    "attributes": [
      29
    ],
    "unit_impact": 0,
    "minimal_quantity": "1",
    "date_formatted": "",
    "available_date": "",
    "id_image": -1,
    "list": "'29'"
  }
};
var contentOnly = false;
var Blank = 1;
var Format = 2;

Meaning that when I print out s.text. I will get an output of the code above. Small edit: If I try to do if s.text.strip().startswith('var combinationsFromController'): it won't find the value and also if I change it the other way around if 'var combinationsFromController' in s.text.strip(): it will print out the same output as above.

However my issue is that I just want to be able to print out var combinationsFromController and skip the rest where I later on can convert the values to a json using json.loads but before that my issue is, How can I print so I can managed to just have the value var combinationsFromController?

EDIT: probably solved it!

for s in product_li_tags:
            if 'var combinationsFromController' in s.text.strip():
                for line in s.text.splitlines():
                    if line.startswith('var combinationsFromController'):
                        get_full_text = line.strip()
                        get_config = get_full_text.split(" = ")
                        cut_text = get_config[1][:-1]
                        get_json_values = json.loads(cut_text)
Hellosiroverthere
  • 285
  • 10
  • 19
  • 1
    If your goal is to parse JavaScript in Python you can consider using an existing library that does that. A number of them are mentioned in this post: https://stackoverflow.com/questions/390992/javascript-parser-in-python – rohit-biswas Dec 18 '18 at 11:09

2 Answers2

1

If I understand correctly your question you have a string of 121 lines representing 5 javascript variables and you want to obtain a substring containing only the 2nd variable.

You can use Python string manipulation as follows:

start = special_code.split('\n').index('var combinationsFromController ={')
end   = special_code.split('\n')[start + 1:].index('var contentOnly = false;')
print('\n'.join(lines[start:end + 3]))

Using method str.index to find occurrences of the javascript variable you need. In case the order variables is arbitrary, i.e. you don't know what is the name of the next variable after the target one, you can still use similar string manipulation to obtain the required substring.

lines = special_code.split('\n')
start = lines.index('var combinationsFromController ={')
end   = lines[-1]
for i, line in enumerate(lines[start + 1:]):
    if 'var' in line:
        end = start + i
        break
print('\n'.join(lines[start:end + 1]))
  • 1
    Note that parsing javascript code (or any language with possibly nested statements) with such simple solutions is brittle at best - it's usually better to use a proper parser that won't trip on technically equivalent but textually different code (ie if a space is added between the '=' and the '{' in the first line). – bruno desthuilliers Dec 18 '18 at 11:56
  • Oh yeah, Thats very true! So meanwhile I tried to do something different and it apprently worked! I edited my thread. What do you think about that? – Hellosiroverthere Dec 18 '18 at 12:26
  • @brunodesthuilliers you are totally right. Proper javascript parser is the best choice. – Alessandro Solbiati Dec 18 '18 at 12:50
  • @Hellosiroverthere, yes your edit might work as well. I would argue is not that clean, maybe a 2 lines solution like my answer or the one using regular expression is cleaner. But you should definitely check some javascript parser as mentioned in other comments – Alessandro Solbiati Dec 18 '18 at 12:50
  • 1
    Yayy! Who knew I would be able to do that but I wouldn't be able to do that with you guys help! – Hellosiroverthere Dec 18 '18 at 12:50
  • 1
    Yeah forsure, I mean I tried to do the same way as you did but I think I got some errors. So I tried to figure out ish from the point as you wrote and went through there. But I agree. Maybe I should do that afterall. – Hellosiroverthere Dec 18 '18 at 12:57
  • if you want to try a javascript parser I would recommend this https://pypi.org/project/slimit/ – Alessandro Solbiati Dec 18 '18 at 13:02
  • Woah! I just tried slimit! Seems to work aswell. A error I got from yours was `ValueError: 'var combinationsFromController ={' is not in list` – Hellosiroverthere Dec 18 '18 at 13:09
1

using re with expression (\{.*?\}); to capture data between var combinationsFromController = and ;var contentOnly = false;

import re

....
print(special_code)
jsonStr = re.search(r'(\{.*?\});', special_code, re.S).group(1)
combinationsFromController = json.loads(jsonStr)

for key in combinationsFromController:
    print(key)
    # 224114
    # 224140
    # 224160
ewwink
  • 18,382
  • 2
  • 44
  • 54