I'm trying to find species names (also called binomal names or Linnean names) such as "Homo sapiens" using regex. The rules should be: two words, first word starts with capital letter, the second doesn't; contains only letters but no numbers, dashes or any other characters. My naive implementation is:
binomal <- "([A-Z]{1}[a-z]{2,}[:space:][a-z]{2,})"
It does find such names, but it also gives positive results (with R's grep
function) when I don't expect them, like in this line of text:
" Japan, China Sea, to Australia"
Any suggestions?
Edit: thanks for your suggestions so far. I should clarify two things: first, each word should have at least two characters (i.e. "A b" shouldn't be captured). Second, I'm actually trying to use this to find such binomal names in an html file. Hence JvdV's misgivings about anchors are unfortunately true... Here is a short excerpt of my html file:
<tr>
<td height="60"> </td>
<td colspan="3"><div align="center"><em>Anadara grandis</em> (Broderip & Sowerby, 1829)<br />
B_ARCI_012 W. Mexico 125mm</div></td>
Here I try to catch "Anadara grandis"