How to automate the retrieving process from a website

Question

Here is a biological database, http://www.genecards.org/index.php?path=/GeneDecks Usually, if I type in a gene name (string) (ex. TF53) and summit it, it will come back with a result on the webpage. Also, it can be chosen if users want to save it as tab-delimited/XML file. However, I have a list of gene name which contains more than thousands of gene name. How can I automate this a series of processes by Java program ?

I know this question can be quite broad and probably has various way to do. With only a little experience in Java programming, I really appreciate if someone can suggest a easier way to do it. Thanks.

score 0 · Answer 1 · answered May 20 '14 at 07:50

0

One of the possibilities is to read gene names sequentially from your list and send for each other that request:

http://www.genecards.org/index.php?path=/GeneDecks/ParalogHunter/<your gene name>/100/{%22Sequence_Paralogs%22:%221%22,%22Domains%22:%221%22,%22Super_Pathways%22:%221%22,%22Expression_Patterns%22:%221%22,%22Phenotypes%22:%221%22,%22Compounds%22:%221%22,%22Disorders%22:%221%22,%22Gene_Ontologies%22:%221%22}

(so basically mimic what the site does).

For example:

http://www.genecards.org/index.php?path=/GeneDecks/ParalogHunter/TNFRSF10B/100/{%22Sequence_Paralogs%22:%221%22,%22Domains%22:%221%22,%22Super_Pathways%22:%221%22,%22Expression_Patterns%22:%221%22,%22Phenotypes%22:%221%22,%22Compounds%22:%221%22,%22Disorders%22:%221%22,%22Gene_Ontologies%22:%221%22}

However, they might not like people using their site in such way (submitting a lot of automatic requests). You might want to check their policy on that. Also, other thing to check is if they have an official API which can be used for batch retrieval of gene information.

answered May 20 '14 at 07:50

Ashalynd

12,363
2
34
37

Thanks for you help. This really help me. However, I don't exactly know how to write this in my JAVA script. It would be very appreciated if you could demonstrate it a little bit ? I need to save that result in a tab-delimited file. – user3631848 May 21 '14 at 04:32
I am afraid it's actually not legal. Look at their Terms and Conditions, p.26: "Academic User agrees not to use any robots, spiders, crawlers or other automated downloading programs or devices to: (i) continuously and automatically search or index any content, unless authorized by YEDA, WIS or LifeMap; (ii) extract data, content, images from our service; or (iii) cause disruption to the working of the Site." – Ashalynd May 21 '14 at 08:37
Alright, I will drop an email to ask if we can access their data just for academic purpose. However, for my technical issue, we can change to another site which is OMIM http://www.omim.org/api They have the API service. It should be more easier to do this I suppose. Could you provide some information about how to do it ? Even an online resource would be grateful. Thanks – user3631848 May 21 '14 at 09:19
A simple bash script would suffice for that purpose. There are a lot of examples, e.g. see this one for inspiration: http://stackoverflow.com/questions/1521462/looping-through-the-content-of-a-file-in-bash You loop over lines in your file with gene names and call a given link, using gene name as a variable, something like `wget "your-url-with-$genemame" > "$genename.txt"` – Ashalynd May 21 '14 at 09:26

How to automate the retrieving process from a website

1 Answers1