Remove
from element being extracted

Question

The website I am trying to extract data from is : http://www.genome.jp/dbget-bin/www_bget?ecs:ECs0037

and I am trying to extract the "nt sequence":

try:
    geneSeq = browser.find_element_by_xpath("html/body/div[1]/table/tbody/tr/td/table[2]/tbody/tr/td[1]/form/table/tbody/tr/td/table/tbody/tr[11]/td").text

except:
    geneSeq = "file\nnot found" 
geneSeq = geneSeq[geneSeq.find('\n')+1:]

I remove the first line of the input as I don't need it but I have br tags within the code which are registered in the file but python does not see them. I have tried .isspace() and it returns false and therefore .rsplit() does not work. Unfortunately the lines still show up when i try to write the sequence to file using f.write.

Is there a way to remove the br tag?

You should probably consider a more full featured web-scraper such as [beautiful soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). — Tom Wyllie, Jul 18 '17 at 16:21
The problem with that is when I use BeautifulSoup and extract the html, the website queries have not been run so i am not actually seeing the sequences — CodingDuck, Jul 18 '17 at 16:58
Using an XPath that long is going to be brittle... you should probably spend some time reading some tutorials on CSS selectors and XPath so you can hand craft them. Your XPath can be replaced with `"//th/nobr[.='NT seq']/following::td"`. — JeffC, Jul 18 '17 at 18:33

score 1 · Answer 1 · answered Jul 18 '17 at 16:21

1

Assuming your html string is named html do this:

html = html.replace('<br>', '')

answered Jul 18 '17 at 16:21

Cory Madden

5,026
24
37

Sorry, I was not explicit enough - the
doesn't show up in the text it just gives me phantom line breaks in my code – CodingDuck Jul 18 '17 at 16:26
Oh, I see. Sorry. It seemed like you didn't understand the functionality of the methods you were trying to use, but this makes more sense. Try the accepted answer here: https://stackoverflow.com/questions/3711856/how-to-remove-empty-lines-with-or-without-whitespace-in-python – Cory Madden Jul 18 '17 at 16:28

score 0 · Answer 2 · answered Jul 18 '17 at 17:09

0

it will print whole html content in python:

import urllib2

req = urllib2.Request('https://www.google.com')
response = urllib2.urlopen(req)
the_page = response.read()

answered Jul 18 '17 at 17:09

score 0 · Answer 3 · answered Jul 18 '17 at 17:13

Thank you for all the answers, because python was not seeing the soace as whitespace i have just ended up doing a loop which checked for characters which seemed to work:

noSpace =""
for char in geneSeq:
    if char.isalpha():
        noSpace = noSpace + char

Remove from element being extracted

3 Answers3

Remove
from element being extracted