I'm working on this project that should scrape websites and output HTML in the form of a JSON, now the only useful things in those JSONs to us are "forms".
I wanted to filter that but the native array filter only works when I know the attribute's location relative to the entire page (DOM??) but that won't always be the case, and I fear checking every object's value till I reach the desired value isn't viable due to
- some pages being humongous,
- form being a string in other places we don't want, this is in NodeJS
Snippet of input:
[
{
"type": "element",
"tagName": "p",
"attributes": [],
"children": [
{
"type": "text",
"content": "This is how the HTML code above will be displayed in a browser:"
}
]
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "form",
"attributes": [
{
"key": "action",
"value": "/action_page.php"
},
{
"key": "target",
"value": "_blank"
}
],
"children": [
{
"type": "text",
"content": "\nFirst name:"
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "input",
"attributes": [
{
"key": "type",
"value": "text"
},
{
"key": "name",
"value": "firstname0"
},
{
"key": "value",
"value": "John"
}
],
"children": []
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "text",
"content": "\nLast name:"
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "input",
"attributes": [
{
"key": "type",
"value": "text"
},
{
"key": "name",
"value": "lastname0"
},
{
"key": "value",
"value": "Doe"
}
],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "input",
"attributes": [
{
"key": "type",
"value": "submit"
},
{
"key": "value",
"value": "Submit"
}
],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "input",
"attributes": [
{
"key": "type",
"value": "reset"
}
],
"children": []
},
{
"type": "text",
"content": "\n"
}
]
},
{
"type": "text",
"content": "\n"
}
]
A snippet of output:
[
{
"type": "element",
"tagName": "form",
"attributes": [
{
"key": "action",
"value": "/action_page.php"
},
{
"key": "target",
"value": "_blank"
}
],
"children": [
{
"type": "text",
"content": "\nFirst name:"
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "input",
"attributes": [
{
"key": "type",
"value": "text"
},
{
"key": "name",
"value": "firstname0"
},
{
"key": "value",
"value": "John"
}
],
"children": []
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "text",
"content": "\nLast name:"
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "input",
"attributes": [
{
"key": "type",
"value": "text"
},
{
"key": "name",
"value": "lastname0"
},
{
"key": "value",
"value": "Doe"
}
],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "element",
"tagName": "br",
"attributes": [],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "input",
"attributes": [
{
"key": "type",
"value": "submit"
},
{
"key": "value",
"value": "Submit"
}
],
"children": []
},
{
"type": "text",
"content": "\n"
},
{
"type": "element",
"tagName": "input",
"attributes": [
{
"key": "type",
"value": "reset"
}
],
"children": []
},
{
"type": "text",
"content": "\n"
}
]
}
]
TL;DR: only retain forms and any of its children.