0

I have a long list of citations for which I need to extract each author's full name, year published, title, etc. One of the citations looks like this:

Joe Bob, Jane Doe and George H. Smith (2017). A title of an interesting report: Part 2. Report Series no. 101, Place for Generating Reports, Department of Report Makers, City, Province, Country, 44 pages. ISBN: (print) 123-0-1234-1234-5; (online) 123-0-1234-1234-5.

And all of the citations are formatted in the same way. The part I am stuck on right now has to do with extracting the author's full names. I read here about how to extract values from a comma, space, or semi-colon separated list here by doing something like [\\s,;]+. How would I do something similar for a comma or the word 'and'?

I assume that 'and' needs to be treated like a group of characters so I tried [^,|[and])]+ to match the spaces between either , or the character set [and] but this doesn't seem to work. This question is similar in that it deals with a comma or a space, but the solution involves the spaces being stripped implicitly.

After getting this portion down I plan on building the rest of the expression to capture the other citation details. So assume that the string we are dealing with is simply:

Joe Bob, Jane Doe and George H. Smith

and each fullname should be captured.

pbreach
  • 16,049
  • 27
  • 82
  • 120
  • 3
    I am not sure you can oversimplify the input like that. You may try [`,\s*|\s+and\s+`](https://regex101.com/r/aIzAky/2) or [`(?:,\s*|\s+and\s+)+`](https://regex101.com/r/aIzAky/1) but it may not be useful in the end. Just FYI: `[and]` matches a single char, `a`, `n` or `d`. To match a sequence of chars, you need to write them outside a character class. – Wiktor Stribiżew Oct 11 '17 at 16:53
  • I think trying to design a single regex is going to be needlessly complicated. I'd first split the string into smaller pieces and then deal with each of those individually. With this approach, then you *could* only deal with your simplified input at the end. – Jared Goguen Oct 11 '17 at 16:56
  • @JaredGoguen You may be right. What you mentioned was the approach I started with but it looked messy so I thought of using a regex. The other details of the citation don't seem to hard to capture so I thought I could kind of string these together. I guess I'll keep going with the splitting approach for now. – pbreach Oct 11 '17 at 17:01
  • 1
    @pbreach I was thinking of combining the two, so using a regex to split the initial citation into parts, and then using individual regexes to process each of those parts. – Jared Goguen Oct 11 '17 at 17:03

1 Answers1

1

Here is one possible approach:

citation = """Joe Bob, Jane Doe and George H. Smith (2017). A title of an interesting report: Part 2. Report Series no. 101, Place for Generating Reports, Department of Report Makers, City, Province, Country, 44 pages. ISBN: (print) 123-0-1234-1234-5; (online) 123-0-1234-1234-5."""

citation = citation.replace(' and ', ',')
citation = citation[:citation.find('(')]

names = [name.strip() for name in citation.split(',')]

print names

Giving you:

['Joe Bob', 'Jane Doe', 'George H. Smith']

Convert and into a comma, slice up to where the year starts and split on a comma.

Or in a more compact form:

names = [name.strip() for name in citation[:citation.find('(')].replace(' and ', ',').split(',')]
Martin Evans
  • 45,791
  • 17
  • 81
  • 97