Get nodes in html document contains word

Question

I want to write a script that checks a document for keywords and specifies html document nodes in which they are contained (possibly assign a unique identifier).

I am not a professional programmer and do not know the strength of low-level languages and things as PLO.. I'm afraid of doing something very bad and unsupported.

How is it possible to isolate the desired nodes?

My experience - js and php - php only for very simple things. Also, I do not want to use the opportunity to work with js nodes. My thoughts:

to make a string of html
verify the existence of the words on the page
if the word on page exists: foreach node in body element I get first and last positions (for example, we see opening tag for each character we initially know position and therefore we calculate the first position where the tag is opened and last where closed. And so on for all nodes).

We know the position of the word (eg 192, 199) and check in what range it got (in this case, these bands - nodes html document).

I need ideas from experienced programmers. It does not matter what language you are programming (except for web-oriented)- every opinion is important to me. It is likely that there are libraries that solve such problems. I very much hope that you will understand me. English is not my native language.

score 1 · Answer 1 · edited May 23 '17 at 11:49

1

You need to use a html parser. Refer

Which HTML Parser is the best?

After that, you need to use xpath feature to extract whichever node.

edited May 23 '17 at 11:49

Community

1
1

answered May 13 '13 at 19:29

bjskishore123

6,144
9
44
66

score 1 · Accepted Answer · answered May 13 '13 at 19:41

I always recommend Beautiful Soup for this kind of thing. It is a Python library that allows you to parse XML/HTML documents really quickly. You could quite quickly get something running that extracts the text from each div element I would have thought. Then using Pythons built-in string manipulation tools I'm sure searching for particular words would be fairly simple.

Get nodes in html document contains word

2 Answers2