0

The website I am trying to extract data from is : http://www.genome.jp/dbget-bin/www_bget?ecs:ECs0037

and I am trying to extract the "nt sequence":

try:
    geneSeq = browser.find_element_by_xpath("html/body/div[1]/table/tbody/tr/td/table[2]/tbody/tr/td[1]/form/table/tbody/tr/td/table/tbody/tr[11]/td").text

except:
    geneSeq = "file\nnot found" 
geneSeq = geneSeq[geneSeq.find('\n')+1:]

I remove the first line of the input as I don't need it but I have br tags within the code which are registered in the file but python does not see them. I have tried .isspace() and it returns false and therefore .rsplit() does not work. Unfortunately the lines still show up when i try to write the sequence to file using f.write.

Is there a way to remove the br tag?

CodingDuck
  • 60
  • 2
  • 9
  • 1
    You should probably consider a more full featured web-scraper such as [beautiful soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). – Tom Wyllie Jul 18 '17 at 16:21
  • The problem with that is when I use BeautifulSoup and extract the html, the website queries have not been run so i am not actually seeing the sequences – CodingDuck Jul 18 '17 at 16:58
  • 1
    Using an XPath that long is going to be brittle... you should probably spend some time reading some tutorials on CSS selectors and XPath so you can hand craft them. Your XPath can be replaced with `"//th/nobr[.='NT seq']/following::td"`. – JeffC Jul 18 '17 at 18:33

3 Answers3

1

Assuming your html string is named html do this:

html = html.replace('<br>', '')

Cory Madden
  • 5,026
  • 24
  • 37
  • Sorry, I was not explicit enough - the
    doesn't show up in the text it just gives me phantom line breaks in my code
    – CodingDuck Jul 18 '17 at 16:26
  • Oh, I see. Sorry. It seemed like you didn't understand the functionality of the methods you were trying to use, but this makes more sense. Try the accepted answer here: https://stackoverflow.com/questions/3711856/how-to-remove-empty-lines-with-or-without-whitespace-in-python – Cory Madden Jul 18 '17 at 16:28
0

it will print whole html content in python:

import urllib2

req = urllib2.Request('https://www.google.com')
response = urllib2.urlopen(req)
the_page = response.read()
0

Thank you for all the answers, because python was not seeing the soace as whitespace i have just ended up doing a loop which checked for characters which seemed to work:

noSpace =""
for char in geneSeq:
    if char.isalpha():
        noSpace = noSpace + char
CodingDuck
  • 60
  • 2
  • 9