I am trying to extract info from a particular website and then store it in a separate text file. for example, i want to go to http://www.ncbi.nlm.nih.gov/nuccore/293762 and extract genome sequences. These sequences are formatted as groups of 10 characters only including the letters "a,t,c,g" separated by white spaces. They will look something like this: "acctgtacgg". Ive been searching for a solution for hours but all I find are java libraries that parse html code such as jsoup. The problem with this is that when I view the source of the website and search for the genome sequences they don't seem to be included in the source code, although I can find them in the DOM tree. Is there a way to programmatically read the actual data on a web page without downloading the source? Or is there a better way to go about this? Please point me in the right direction it will be greatly appreciated.
Asked
Active
Viewed 225 times
0

DisappointedByUnaccountableMod
- 6,656
- 4
- 18
- 22

terence vaughn
- 521
- 2
- 11
- 23
-
It would seem that the results are generated by an AJAX call, you would need something that handle the AJAX request and completion and then parse the results...all make the AJAX call yourself... – MadProgrammer Oct 22 '14 at 04:09
-
You need something like headless browser (e.g. `HtmlUnit`) which can load complete web page for you. There are some libraries which uses Selenium can also do this. – RandomQuestion Oct 22 '14 at 04:10
-
`HtmlUnit` http://htmlunit.sourceforge.net/ – RandomQuestion Oct 22 '14 at 04:10
-
See http://stackoverflow.com/questions/260540/how-do-you-scrape-ajax-pages – Neo Oct 22 '14 at 04:18