Intelligent pattern matching in string

Question

Let's say I have filenames which are formatted differently. I want to be able to extract certain aspects from said filename like a human would; pattern recognition.

Obviously I can bruteforce myself through with regular expressions but that's not what I'm after. Let's say I have these 4 strings:

[MAS] Hayate no Gotoku!! 20 [BD 720p] [21D138F8].mkv
[Leopard-Raws] Akatsuki no Yona - 05 RAW (MX 1280x720 x264 AAC).mp4
[BLAST] Wolf Girl and Black Prince - 05 [720p] [C1252A5E].mkv
[sage]_Mobile_Suit_Gundam_AGE_-_36_[720p][10bit][45C9E0D0].mkv

As you can see all these filenames have certain pattern in them but are not quite the same. So a silver bullet regular expression wouldn't cut it. Instead I want to look at computational intelligence techniques such as ANN's or another smart idea to solve this problem.

Let's say we want to extract the filenames. Humans would return these values:

Hayate no Gotoku!!
Akatsuki no Yona
Wolf Girl and Black Prince
Mobile Suit Gundam AGE

Or episode numbers: 20, 05, 05, 36. You get where I'm going with this.

What suggested techniques would be useful to achieve the desired result, or is this something that is being researched at universities and still has no solution?

Do you have a labelled training set? – Drew Noakes Nov 04 '14 at 22:48 — Drew Noakes, Nov 04 '14 at 22:48
@DrewNoakes I could create a training set – Ortixx Nov 04 '14 at 22:57 — Ortixx, Nov 04 '14 at 22:57

score 2 · Accepted Answer · edited May 23 '17 at 11:49

What you are looking for is called grammar induction and it works but making a program figure out a regular expression (or some other type of pattern) that matches certain strings but not others. You have to give it the strings yourself however, called a training set, with positive examples (strings that should be matched) and negative examples (strings that shouldn't be matched).

An interesting technique is called boosting where you learn a lot of simple patterns which are precise (do not match negative examples) but match only a few positive examples; however when combined together will match a large amount of positive examples.

Since you want to extract substrings rather than just match strings, the way I would go about it is to take prefixes of the file names and try to match those. In this way you'd know where the substring starts. Here's an example:

Positives:
[MAS] 
[Leopard-Raws] 
[BLAST] 
[sage]_

Negatives:
[MAS] H
[Leopard-Raws] Akat
[BL
[sage]_Mobile_Suit_Gundam_AGE_

If done correctly, you should obtain a regular expression which you can use on prefixes of the file names. By growing the prefix one letter at a time you can know where the content of interest starts. Like this:

[ False
[s False
[sa False
[sag False
[sage False
[sage] True
[sage]_ True
[sage]_M False

What happened here is that I increased the prefix of the file name one character at a time until the regular expression I learnt matched it. But I also wanted to find the longest prefix that matches (because otherwise I would have missed the underscore since [sage] is an acceptable prefix as well) so I continued moving forward until the regular expression stopped matching. In this way I would know that the prefix before the actual content starts is "[sage]_". You can do the same for matching where it ends as well by using prefixes which include the content of interest.

To learn about regular expression learning see this post. Keep in mind that automated learning will never be perfect but the more examples you use the more accurate it will be.

Intelligent pattern matching in string

1 Answers1

Linked