61

The DOI system places basically no useful limitations on what constitutes a reasonable identifier. However, being able to pull DOIs out of PDFs, web pages, etc. is quite useful for citation information, etc.

Is there a reliable way to identify a DOI in a block of text without assuming the 'doi:' prefix? (any language acceptable, regexes preferred, and avoiding false positives a must)

Kai
  • 5,260
  • 5
  • 29
  • 36

7 Answers7

65

Ok, I'm currently extracting thousands of DOIs from free form text (XML) and I realized that my previous approach had a few problems, namely regarding encoded entities and trailing punctuation, so I went on reading the specification and this is the best I could come with.


The DOI prefix shall be composed of a directory indicator followed by a registrant code. These two components shall be separated by a full stop (period).

The directory indicator shall be "10". The directory indicator distinguishes the entire set of character strings (prefix and suffix) as digital object identifiers within the resolution system.

Easy enough, the initial \b prevents us from "matching" a "DOI" that doesn't start with 10.:

$pattern = '\b(10[.]';

The second element of the DOI prefix shall be the registrant code. The registrant code is a unique string assigned to a registrant.

Also, all assigned registrant code are numeric, and at least 4 digits long, so:

$pattern = '\b(10[.][0-9]{4,}';

The registrant code may be further divided into sub-elements for administrative convenience if desired. Each sub-element of the registrant code shall be preceded by a full stop.

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*';


The DOI syntax shall be made up of a DOI prefix and a DOI suffix separated by a forward slash.

However, this isn't absolutely necessary, section 2.2.3 states that uncommon suffix systems may use other conventions (such as 10.1000.123456 instead of 10.1000/123456), but lets cut some slack.

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/';


The DOI name is case-insensitive and can incorporate any printable characters from the legal graphic characters of Unicode. The DOI suffix shall consist of a character string of any length chosen by the registrant. Each suffix shall be unique to the prefix element that precedes it. The unique suffix can be a sequential number, or it might incorporate an identifier generated from or based on another system.

Now this is where it gets trickier, from all the DOIs I have processed, I saw the following characters (besides [0-9a-zA-Z] of course) in their suffixes: .-()/:- -- so, while it doesn't exist, the DOI 10.1016.12.31/nature.S0735-1097(98)2000/12/31/34:7-7 is completely plausible.

The logical choice would be to use \S or the [[:graph:]] PCRE POSIX class, so lets do that:

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/\S+'; // or

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/[[:graph:]]+';


Now we have a difficult problem, the [[:graph:]] class is a super-set of the [[:punct:]] class, which includes characters easily found in free text or any markup language: "'&<> among others.

Lets just filter the markup ones for now using a negative lookahead:

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+'; // or

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])[[:graph:]])+';


The above should cover encoded entities (&), attribute quotes (["']) and open / close tags ([<>]).

Unlike markup languages, free text usually doesn't employ punctuation characters unless they are bounded by at least one space or placed at the end of a sentence, for instance:

This is a long DOI: 10.1016.12.31/nature.S0735-1097(98)2000/12/31/34:7-7!!!

The solution here is to close our capture group and assert another word boundary:

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+)\b'; // or

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])[[:graph:]])+)\b';

And voilá, here is a demo.

Terence Eden
  • 14,034
  • 3
  • 48
  • 89
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • 4
    It is not the case (or no longer the case) that all assigned registrant codes are at least four digits long. For example, 10.231 is the Journal of Investigative Medicine. E.g., 10.231/JIM.0b013e31820bab4c – David Conrad Aug 22 '12 at 18:04
  • 4
    Wiley uses "<" and ">" in their DOIs. For instance, 10.1002/(SICI)1522-2594(199911)42:5<952::AID-MRM16>3.0.CO;2-S is a valid DOI. This DOI is not captured by the above regex. A quick fix is to remove open/close tags from the set of non-DOI-characters. (See https://sourceforge.net/p/jabref/patches/203/) – koppor May 25 '13 at 22:06
  • 2
    10.1002/(SICI)1522-2594(199911)42:5<952::AID-MRM16>3.0.CO;2-S is a valid DOI, so the above regexp should be modified to something like `$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'])\S)+)\b';` – jpowell May 25 '13 at 22:07
  • 1
    @koppor, did you miss David Conrads comment above that the registrant code now can be 3 digits? (It looks so in your patch on JabRef on SourceForge.) – Leo Dec 12 '14 at 00:31
  • 1
    There is another problem here: HTML escaping and URL encoding. Anyone has a regex taking care of that too? – Leo Dec 12 '14 at 00:37
  • 1
    @AlixAxel thanks for your post, there are several patterns in your answer with instruction, can you please put the final pattern or patterns that you are suggesting for a DOI in a conclusion section in your post? thanks again – epsi1on Nov 02 '15 at 07:18
  • I have this question that finds many DOI patterns https://stackoverflow.com/questions/43683957/whats-the-correct-format-of-java-string-regex-to-identify-doi, apart from this format of DOI "10.1175/1520-0485(2002)032<0870:CT>2.0.CO;2" does anyone have a REGEX that can MATCH this format? – Hector Jul 14 '17 at 11:12
  • 1
    @DavidConrad - Where did you get that JIM DOI from? The DOI resolver at https://www.doi.org fails to resolve it. – Robert Knight Sep 07 '17 at 09:27
  • 1
    @RobertKnight Hmm, strange. I think I got it from a ProQuest search, and if you Google "10.231/JIM" you can find lots of others, but they also don't resolve, now. It looks like JIM is using 2310 now and if you change it to 10.2310/JIM.etc. it does resolve. Weird. – David Conrad Sep 08 '17 at 08:04
  • In most of those regex is a `)` at the end missing. You open 2 brackets, but only close one. – blkpingu Oct 04 '19 at 15:49
  • 1
    The word boundary / `\b` might be problematic in a few edge-cases, because a [trailing period/dot/`.` at the end of a DOI can actually be part of it](https://twitter.com/adam42smith/status/1283113869758660610). – Katrin Leinweber Aug 23 '20 at 14:43
  • Would anyone be able to provide an example of successful usage in bash (or, any other language)? I've tried this in awk with so many variations and I can't get it to work. Works perfectly on regxr.com. – kubu4 May 03 '21 at 19:16
22

CrossRef has a recommendation, that they tested successfully on 99.3% of DOIs (known to them):

/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
Katrin Leinweber
  • 1,316
  • 13
  • 33
  • 3
    But also keep in mind that: _Crossref is not the only DOI Registration Agency and while our members account for 65-75% of all registered DOIs this means there are tens of millions of DOIs that we have not seen._ – Carlos Vega Oct 10 '19 at 09:53
14

@Silas The sanity checking is a good idea. However, the regex doesn't cover all DOIs. The first element must (currently) be 10, and the second element must (currently) be numeric, but the third element is barely restricted at all:

"Legal characters are the legal graphic characters of Unicode. This specifically excludes the control character ranges 0x00-0x1F and 0x80-0x9F..."

and that's where the real problem lies. In practice, I've never seen whitespace used, but the spec specifically allows for it. Basically, there doesn't seem to be a sensible way of detecting the end of a DOI.

Kai
  • 5,260
  • 5
  • 29
  • 36
4

I'm sure it's not super-helpful for the OP at this point, but I figured I'd post what I am trying in case anyone else like me stumbles upon this:

(10.(\d)+/(\S)+)

This matches: "10 dot number slash anything-not-whitespace"

But for my use (scraping HTML), this was finding false-positives, so I had to match the above, plus get rid of quotes and greater-than/less-than:

(10.(\d)+/([^(\s\>\"\<)])+)

I'm still testing these out, but I'm feeling hopeful thus far.

rgcb
  • 1,111
  • 1
  • 11
  • 17
  • While this regex probably works for all existing DOI names the specification says: _"All prefixes so far issued have been simple numeric strings, but there is nothing to prevent alphabetical characters being used. The prefix may be further divided into sub-prefixes, for example: 10.1000.10/123456"_ – Tor-Erik Oct 21 '11 at 15:21
  • The cited text of Ju9OR may be found at [the DOI manual: The structure of a DOI name](http://www.doi.org/handbook_2000/enumeration.html#2.2) – koppor Feb 01 '12 at 14:40
  • 2
    Based on the comments, the regexp `(10\.[^/]+/([^(\s\>\"\<})])+)` works for me (especially in BibTeX files) – koppor Feb 01 '12 at 14:46
  • @koppor: The DOI `10.1016/S0735-1097(98)00347-7` is valid. – Alix Axel Apr 24 '12 at 14:04
  • 5
    -1, This answer is quite bad to be honest, not even the dot meta character is being escaped. – Alix Axel Apr 24 '12 at 15:32
3

Here is my go at it:

(10[.][0-9]{4,}[^\s"/<>]*/[^\s"<>]+)

And a couple of valid edge cases where this doesn't fail, but others seem to do:

Also, correctly discards some falsy (X|HT)ML stuff like:

  • <geo coords="10.4515260,51.1656910"></geo>
Community
  • 1
  • 1
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
2

This is a really old and answered question, but here's another potential substitute.

\b10\.(\d+\.*)+[\/](([^\s\.])+\.*)+\b

This assumes that white space is not part of the DOI.

Haven't tested this for false positives, but it seems to be able to find all the edge cases mentioned in this page.

Terence Eden
  • 14,034
  • 3
  • 48
  • 89
hobwell
  • 538
  • 1
  • 8
  • 26
2

The following regex should do the job (Perl regex syntax):

/(10\.\d+\/\d+)/

You could do some additional sanity checking by opening the urls

http://hdl.handle.net/<doi>

and

http://dx.doi.org/<doi>

where is the candidate doi,

and testing that you a) get a 200 OK http status, and b) the returned page is not the "DOI not found" page for the service.

Silas Snider
  • 1,460
  • 1
  • 14
  • 23
  • 4
    This regex does not match all DOIs (particularly the ones which contain letters or dots after the slash), such as http://dx.doi.org/10.1038/ejcn.2010.73 ) – Romain Guidoux Dec 10 '11 at 11:56
  • But the "sanity checking" advice is a good one! By now however, [`https://doi.org/` should be used](https://www.doi.org/doi_handbook/3_Resolution.html#3.8) to resolve DOIs :-) – Katrin Leinweber Jan 30 '18 at 14:44