0

There is a user input string, which can have two different formats with some little variations:

Some AB, Author C, Names DEF,(2018) The title string. T journal name, 10, 560–564
Some AB, Author C, Names DEF (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF et al (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018; 10: 560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018;10:560-564

What I need to get is:

  1. Author string part: Some AB, Author C, Names DEF or Some AB, Author C, Names DEF et al
  2. Article title string: The title string or The title string?
  3. Journal name string: T journal name
  4. Year value: 2018
  5. Edition value: 10
  6. Page numbers 560-564

So I have to split the string by the delimiter . or (1234), ; and :.

I don't get a working regex for that and I don't know how to handle both formats, which have the year value at different position.

I started with something like:

string.split(/^\(\d+\)\s*/)

But how do I have to proceed as I'm getting an array for that.

user3142695
  • 15,844
  • 47
  • 176
  • 332

3 Answers3

1

I also would suggest going with a match pattern:

^([^.(]+)(?:\((\d{4})\)|\.)\s*([^?!.]*.)\s*([^0-9,]+)(\d{4})?[,; ]*([^,: ]*)[,;: ]*(\d+(?:[–-]\d+)?)

Or a more readable version with named capture groups*:

^(?<author>[^.(]+)(?:\((?<yearf1>\d{4})\)|\.)\s*(?<title>[^?!.]*.)\s*(?<journal>[^0-9,]+)(?<yearf2>\d{4})?[,; ]*(?<issue>[^,: ]*)[,;: ]*(?<pages>\d+(?:[–-]\d+)?)

I've support and Schifini's approach to using negated character classes to find the required pieces.
To distinguish between the two different formats I've added two optional named groups for year format 1 and format 2, and wrapped up the rest in additional capture groups. The only thing left is to check whether group 2 or group 5 holds the year.

Demo

Code sample:

const regex = /^([^.(]+)(?:\((\d{4})\)|\.)\s*([^?!.]*.)\s*([^0-9,]+)(\d{4})?[,; ]*([^,: ]*)[,;: ]*(\d+(?:[–-]\d+)?)/gm;
const str = `Some AB, Author C, Names DEF,(2018) The title string. T journal name, 10, 560–564
Some AB, Author C, Names DEF (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF et al (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018; 10: 560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018;10:560-564`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    array={};
    m.forEach((match, groupIndex) => {
        switch(groupIndex) {
        case 0:
            console.log(`Full match: ${match}`);
            break;
        case 1:
            array['author'] = match.trim();
            break;
        case 2:
            if(match)
                array['year'] = match;
            break;
        case 3:
            array['title'] = match.trim();
            break;
        case 4:
            array['journal'] = match.trim();
            break;
        case 5:
            if(match)
                array['year'] = match.trim();
            break;
        case 6:
            array['issue'] = match.trim();
            break;
        case 7:
            array['pages'] = match.trim();
            break;        
        default:
            console.log(`Unknown match, group ${groupIndex}: ${match}`);
        }
    });
    console.log(JSON.stringify(array));
}

*Named capture groups in Javascript are not supported in all major browsers. Just remove them or use Steve Levithan's XRegExp library solves these problems.

wp78de
  • 18,207
  • 7
  • 43
  • 71
0

Since you don't have a specific separator you have to extract the parts you need, in most cases, piece by piece.

For these examples you could get Authors, Article name and Journal with:

str.match(/^([^.(]*)[^ ]*([^?.]*.)([^0-9,]*)/)
  • ^([^.(]*) captures everything from the start until it finds a ( or .

  • [^ ]* skips possible year (2018) before the article.

  • ([^?.]*.) captures the Article Name

  • and ([^0-9,]*) captures the Journal Name

The match will return an array with four elements. The three captures are at index 1 to 3.

See Regex101.

The number matches are doable. Try using another separate regexp to capture them. The year may be tricky since a four digit number could also be a page number.

R. Schifini
  • 9,085
  • 2
  • 26
  • 32
0

Rather than trying to figure out a complex regex, which IMHO is not possible in this case, you can write function to parse the strings. According to your sample data, it can be something like this:

var str = [
  "Some AB, Author C, Names DEF,(2018) The title string. T journal name, 10, 560–564",
  "Some AB, Author C, Names DEF (2018) The title string? T journal name 10:560-564",
  "Some AB, Author C, Names DEF et al (2018) The title string? T journal name 10:560-564",
  "Some AB, Author C, Names DEF. The title string. T journal name 2018; 10: 560-564",
  "Some AB, Author C, Names DEF. The title string. T journal name 2018;10:560-564"
];

function parse(str) {
  var result = [];
  var tmp = "";
  for (var i = 0; i < str.length; i++) {
    var c = str.charAt(i);
   
    if(c === ",") {
      if(str.charAt(i + 1) === "(") {
          result.push(tmp.trim());
          i++;
          tmp = "";
          continue;
      }
      
      if((str.charAt(i + 1) === " ") && !isNaN(str.charAt(i + 2))) {
        result.push(tmp.trim());
        i++;
        tmp = "";
        continue;
      }
    }
    
    if((c === ".") || (c === "?") || (c === ":")) {
     if(str.charAt(i + 1) === " ") {
          result.push(tmp.trim());
          i++;
          tmp = "";
          continue;
      }
    }    

    if((c === "(") || (c === ")") || (c === ";")  || (c === ":")) {
      result.push(tmp.trim());
      tmp = "";
      if(str.charAt(i + 1) === " ") {
       i++;
      }
      continue;
    }
    
    if((c === " ") && !isNaN(str.charAt(i + 1))){
      result.push(tmp.trim());
      tmp = "";
      continue;
    }
    
    tmp += c;
  }
  result.push(tmp.trim());
  
  if(!isNaN(result[3])) {
   result = [result[0], result[3], result[1], result[2], result[4], result[5]];
  }
  
 return result;
}

for(var j = 0; j < str.length; j++) {
 console.info(parse(str[j]));
}
xxxmatko
  • 4,017
  • 2
  • 17
  • 24