-1

I have lines like these :

[b'\t\t\t\t\t\t\t\t<td><a href=info.php?id=Cre35.g759247.t1.1+ target="_blank">Cre35.g759247.t1.1 </a></td>']
[b'\t\t\t\t\t\t\t\t<td><a href=info.php?id=Cre17.g739850.t1.2 target="_blank">Cre17.g739850.t1.2</a></td>']
[b'\t\t\t\t\t\t\t\t<td><a href=info.php?id=Cre17.g737200.t1.2 target="_blank">Cre17.g737200.t1.2</a></td>']

I'm after the word that starts with "Cre" and ends with ".t"number"."number.

How exactly can I extract it?

Matthew
  • 1,412
  • 2
  • 20
  • 35

3 Answers3

0

This regex should do the trick: Cre.*?\.t\d\.\d It first looks for Cre literally, followed by any characters (but as few as possible), and then ends with a literal . and t, a digit, a ., and another digit.

Try it here!

Nick Reed
  • 4,989
  • 4
  • 17
  • 37
0
from bs4 import BeautifulSoup

html = '''[b'\t\t\t\t\t\t\t\t<td><a href=info.php?id=Cre35.g759247.t1.1+ target="_blank">Cre35.g759247.t1.1 </a></td>']
[b'\t\t\t\t\t\t\t\t<td><a href=info.php?id=Cre17.g739850.t1.2 target="_blank">Cre17.g739850.t1.2</a></td>']
[b'\t\t\t\t\t\t\t\t<td><a href=info.php?id=Cre17.g737200.t1.2 target="_blank">Cre17.g737200.t1.2</a></td>']'''

# BeautifulSoup -> parsing source of the HTML.

soup = BeautifulSoup(html) 

print(soup)

# View HTML code.
print(soup.prettify())

# Get infomation
site_names = soup.find_all('a')

for site_name in site_names:
    print(site_name.get_text())

Jee Mok
  • 6,157
  • 8
  • 47
  • 80
0

Looks like you don't need regex and can rely on the attribute = value css selectors with contains (*) operator

from bs4 import BeautifulSoup

html = '''[b'\t\t\t\t\t\t\t\t<td><a href=info.php?id=Cre35.g759247.t1.1+ target="_blank">Cre35.g759247.t1.1 </a></td>']
[b'\t\t\t\t\t\t\t\t<td><a href=info.php?id=Cre17.g739850.t1.2 target="_blank">Cre17.g739850.t1.2</a></td>']
[b'\t\t\t\t\t\t\t\t<td><a href=info.php?id=Cre17.g737200.t1.2 target="_blank">Cre17.g737200.t1.2</a></td>']'''

soup = bs(html, 'html.parser')
items = [i.text for i in soup.select("[href*='php?id=Cre']")]
print(items)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thanks, how can I retrieve the fasta sequence and the length of sequence from this page in a similar fashion: https://phytozome.jgi.doe.gov/phytomine/portal.do?class=Protein&externalids=Cre06.g250300.t1.2 – ahmadkhalifa Sep 11 '19 at 04:23
  • thx, the website is currently acting up, the problem is I can see the information regarding fasta and the length of the sequence on the website, but when I run beutiful soup and ctrl+f to find that information, I can't find it, and hence, no coding attempt – ahmadkhalifa Sep 11 '19 at 04:49
  • one moment whilst I look for cached version – QHarr Sep 11 '19 at 04:50
  • you can find many of these entries here: http://chlamyfp.org/readcsvfile_js.php, thx to your code, now I can retrieve all of them, but what I want to ultimately do, is get their sequence and length. Also, I wonder if I can "select" certain lines that come after my "[href*='php?id=Cre']" selection,? and what is the anatomy of this selection, I want to be able to select other stuff i na similar way – ahmadkhalifa Sep 11 '19 at 04:56
  • what do you mean by sequence and length? – QHarr Sep 11 '19 at 05:06
  • it's working now! I mean the "length" which is 287 and the "fasta" which gives a string of letters, that's their length https://phytozome.jgi.doe.gov/phytomine/portal.do?class=Protein&externalids=Cre12.g511750.t1.2 – ahmadkhalifa Sep 11 '19 at 05:09
  • I did shft+ctrl+p and disabled java, anyway the line I'm after is: how can I pull the length from that line? Also, from line – ahmadkhalifa Sep 11 '19 at 06:01
  • thanks, you're really good! but what does select_one() does exactly – ahmadkhalifa Sep 11 '19 at 06:28
  • it applies the css selector within the quotes to the html and returns the first match. Also faster. https://stackoverflow.com/questions/39033612/bs4-select-one-vs-find , https://www.crummy.com/software/BeautifulSoup/bs4/doc/# , https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors – QHarr Sep 11 '19 at 06:31
  • ok great, I will definitely try to learn that stuff, but for now what about retrieving the number for fasta or fasta sequence directly? – ahmadkhalifa Sep 11 '19 at 06:40
  • ? what do you mean. Code extracts both of those for you. For sequence I think you have to mimic the button push (xhr request) as shown in pastebin. – QHarr Sep 11 '19 at 06:55
  • oh, but sometimes I get length = soup.select_one('.value').text.strip() AttributeError: 'NoneType' object has no attribute 'text', I also realized the website has an API, but I couldn't find the python script they mentioned here: https://phytozome.jgi.doe.gov/phytomine/api.do?subtab=python – ahmadkhalifa Sep 11 '19 at 07:31
  • Test if the variable returned is None before attempting to access .text else return N/A or other value – QHarr Sep 11 '19 at 07:43