0

Well, I've searched, there is a lot of questions about names, but I couldn't find any solution for the case I'm looking for.

I set the text with jQuery into a var when there is a cast in the description. So I'm trying to get the cast of movies. The problem is that the text content also has categories that are also separated with commas, and may be wrongly detected as the cast.

Therefore if I deny when the text has more than 3 letters capitalized I can filter the correct cast in the returned text.


How the cast are usually disposed inside the text:

John Smith, Mary Jane, Neo, Trinity, Morpheus, Mr. Anderson


Sometimes the last name is not present which is the reason complicating things. This way it confuses with categories. If I could set a bunch of words from a denied list maybe it would be better than deny capitalized letters which is very usual to be under the categories.

var castExists = $('span.post-bold:contains("Cast")');
var cast = "";

if (castExists.length) {
    cast = $("div.post-message").text();
    var reg = /^(?!\s)([a-z ,A-Z.'-]+)/gm;
    var getCast = reg.exec( cast );
    if (getCast !== null) {
        cast = getCast[0].toString().trim();
    }
    else {
        getCast = '';
    }
}

Title: Movie Title

Production: Something

Year: 2021

Categories: Drama, HORROR, Sci Fi, TV Show, Action

Cast: John Smith, Mary Jane, Neo, Trinity, Morpheus, Mr. Anderson

While Title:, Year:, Cast: , etc are under the span.post-bold tag, everything is inside the div.post-message

For example:

<div class="post-message">
<span class="post-bold">Title</span>
: Movie Title
<span class="post-bold">Year</span>
: 2021
<span class="post-bold">Categories</span>
: Drama, HORROR, Sci Fi, TV Show, Action
<span class="post-bold">Cast</span>
: John Smith, Mary Jane, Neo, Trinity, Morpheus, Mr. Anderson
</div>

As it depends how an user created, the order of things may be different.


Here, the last regex I was trying to write, but which wasn't working

([A-Z][a-z]{1,}( |, )([A-Z][a-z]{1,})?)+



Update:

I created this link on regex101.com on regex101 with the examples, as I saw in the comments appears I wasn't so clear on the question. This way I think people have better chance to help me. The ones with names should get, the ones with categories must not.

PS: I set the regular expression Mohammad told me in the comments on the link.

buddemat
  • 4,552
  • 14
  • 29
  • 49
Commentator
  • 640
  • 1
  • 6
  • 22
  • All that said, it's not clear to me? ... Are you only wanting to regex the `Cast` information, but not categories, or both? What does the 3 letters capitalized mean, like the word `HORROR` example? – Paul T. Sep 05 '21 at 01:17
  • I want only the Cast. Usually the categories have one of those words that has 3 letters or more capitalized, so it would be a way to filter it. To detect that it is not the Cast line. Most of names has first and last name, categories usually not, but still can have it double names before a comma. – Commentator Sep 05 '21 at 01:22
  • could you please use this ([A-Z]([a-z]|[A-Z])*[\.]*(\s)*([a-z]|[A-Z])*)[,] and let me know about it's work or not? – mohammad mobasher Sep 05 '21 at 04:53
  • Unfortunately didn't work Mohammad. [I created a link in regex101, so you guys can test it](https://regex101.com/r/7gX4pQ/1/). Thanks for trying to help. – Commentator Sep 05 '21 at 05:39

2 Answers2

2

I don't think using a regex that finds three consecutive capital letters to filter out the cast line is the way to go.

Firstly, based on your example, this will not always work, as your example has a line with categories that does not have a word with three consecutive capital letters.

Secondly, if you are looking at actual names, it is very possible that you will also wrongly filter out lines with names, if e.g. someone like Rodney L Jones III is among the cast (take a look at these interesting wrong assumptions that programmers make about names).

Instead, you can just extract the cast line by first finding a span that contains Cast (using filter()) and then get the text of the next node (using nextSibling.nodeValue). I also used substring() to trim characters at the beginning (remove the colon and the spaces) and trim() to remove the newline from the end of it:

$("div.post-message").contents().filter(function() { return ($(this).text() === 'Cast');
})[0].nextSibling.nodeValue.substring(3).trim();

This gives you the cast line:

John Smith, Mary Jane, Neo, Trinity, Morpheus, Mr. Anderson

This you can then simply split at each comma (assuming that there are no names with commas within them):

var actors = getCast.split(', ');
for (var i = 0; i < actors.length; i++) {
  console.log(actors[i]);
}

Output:

John Smith
Mary Jane
Neo
Trinity
Morpheus
Mr. Anderson

Test it here.

buddemat
  • 4,552
  • 14
  • 29
  • 49
0

Here is my suggestion for your question: regex101

BTW, If you have a good format, try this!

var str = 'John Smith, Mary Jane, Neo, Trinity, Morpheus, Mr. Anderson'
console.log(str.split(', '))
Thai Do
  • 77
  • 1
  • 1
  • This gave me an idea, but it's not the magical regular expression I was looking for. Maybe the only solution is separate things and analyze them with "if" conditions. – Commentator Sep 05 '21 at 21:23