1

Suppose I have a long string:

AX 90 10 20 AX 30 14 50 ER 40 50 68 ...

What I need is

['AX 90 10 20', 'AX 30 14 50', 'ER 40 50 68',...]

I don't want to use a regex as I'll get different pattern of repetition such as the below where regex for above won't work for the below

WE 12 (09/09) ER 14 (12/56) TO 90 (45/67) ...

I started with creating the structural representation(converting [A-Z] to 'A' and [0-9] to '9') for the first example

AX 90 10 20 AX 30 14 50 ER 40 50 68 ...
to
AA 99 99 99 AA 99 99 99 AA 99 99 99 ...

My question is, how do I go by recognizing pattern in each string on the fly and then get the matches?

NOTE: The pattern is unknown, but it is known that some set of charcters repeat after sometime

I don't want to use regex written manually. If system generates the regex, then it would be fine.

Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • How do you define a pattern? Two alphabet characters followed by a series of numbers separated by whitespace? – zwer Oct 25 '18 at 06:38
  • @zwer For clarity, i have edited the question – Vishnudev Krishnadas Oct 25 '18 at 06:40
  • Ok, what's a "some set of characters"? You need to define what constitutes a pattern and what doesn't, otherwise any part of that data can be defined as a pattern (i.e. whitespace is obviously a pattern with it occupying every third character). – zwer Oct 25 '18 at 06:46
  • @zwer I have specified that the pattern is unknown. That means I don't know the characteristics of a pattern, but only know that it is repetitive. – Vishnudev Krishnadas Oct 25 '18 at 07:26
  • 2
    Perhaps you could supply some other examples of inputs and expected outputs to help illustrate what you mean by "the pattern is unknown but repetitive". – Doddie Oct 25 '18 at 07:35
  • 4
    Well, if you don't know how to define a valid pattern, how do you expect of a computer to find one? The most obvious pattern in your example data is the whitespace as every third character and yet you claim that should not be recognized as a pattern - if you are able to discriminate between valid and invalid patterns, surely you can define what a valid pattern is and what it isn't? – zwer Oct 25 '18 at 07:38
  • @zwer You have a valid question. This example would explain what I am trying to say: string="AZ893249EE886342TT125435". I convert it into "AA999999AA999999AA999999" so that i could see the pattern and then split as ["AA999999", "AA999999", "AA999999"] having smallest repeating set of chars. – Vishnudev Krishnadas Oct 25 '18 at 07:49
  • 2
    But what standard are you using when 'converting' it? You need to have clear definitions of what constitutes a pattern - I, with all of my biological pattern recognition and intuition cannot see the pattern in your data different than digits being separated by alpha characters in a repeating fashion (two alpha chars, six digits etc.), how do you expect a computer to do better? Computers expect clearly defined rules (even for fuzzy logic), they cannot just intuit what the user expects from them. – zwer Oct 25 '18 at 08:22
  • Thank you for your interest @zwer . To make it simple, assume that the pattern starts with an alphabet. What would be your approach? If the pattern is unknown you say that this problem is unsolvable? – Vishnudev Krishnadas Oct 25 '18 at 09:41

2 Answers2

3

You can try of using time series analysis seasonality to get similar patterns in sequence

for that you can try of converting string to integers and apply seasonal_decompose with use of statsmodels ,then you can observe the period of repeated pattern from graph.

from matplotlib import pyplot
from statsmodels.tsa.seasonal import seasonal_decompose
a = 'AX 90 10 20 AX 30 14 50 ER 40 50 68'
a = list(map(ord,a))
series = pd.Series(a ,index = pd.date_range('1-1-2011',pd.to_datetime('1-1-2011')+np.timedelta64(len(a)-1,'D'),freq='D'))
result = seasonal_decompose(series, model='additive')

result.observed.plot()
result.trend.plot()
pyplot.rcParams["figure.figsize"] = (20,3)
pyplot.show()

Observed seasonality and the trend of sequence enter image description here

then with the observed period you can split the sequence

Edit

To find the periodicity of the sequence without visual inspection,

We can find the periodicity of signal with the use of autocorrelation and with the correlation lag of signal which show the periodicity. with that we can slice the pattern to retrieve similar patters

def autocorr(x):
    n = x.size
    norm = (x - np.mean(x))
    result = np.correlate(norm, norm, mode='same')
    acorr = result[n//2 + 1:] / (x.var() * np.arange(n-1, n//2, -1))
    lag = np.abs(acorr).argmax() + 1
    return lag
period = autocorr(np.array(a))



#Here the pattern is repeating for each period of 12 values, you can pick the period also 
period = 12
for i in range(0,len(a),period):
    print(''.join(map(chr,a[i:i+period])))

Out:

AX 90 10 20 
AX 30 14 50 
ER 40 50 68
Naga kiran
  • 4,528
  • 1
  • 17
  • 31
  • The answer looks promising, but it went over my head. Could you explain a bit more on the statistical analysis you have done and on the usage of time-series or provide any link to read? – Vishnudev Krishnadas Oct 25 '18 at 07:59
  • 1
    Just one more question, Is there any way to determine the period using code rather than an visual inspection? – Vishnudev Krishnadas Oct 25 '18 at 08:02
  • 1
    This yields a visual result only due to the fact that `ascii` value of digits is lower than that of alphabet characters (btw. you don't need to use time series at all here) - try analyzing a structure like `AA991199BB990099AA880088` using the same technique. It doesn't solve the problem of automatic pattern recognition, but then again how could it when we're still debating what is a 'pattern' in this case? – zwer Oct 25 '18 at 08:17
  • 1
    @Vishnudev - We're still discussing the question, no? Until you define what constitutes a pattern and how to discern between a pattern and a no-pattern any given answer will be wrong. – zwer Oct 25 '18 at 09:16
  • Hi @Vishnudev I just added the snippet to find the periodicity of signal without visual inspection , you can have look – Naga kiran Oct 25 '18 at 09:52
1

If you know that your target strings all start with an N-length (here 2) uppercase pattern, I'm not really sure this is that complicated.

The following is a possible solution:

import re # only used in is_token, but not technically needed

def is_token(t):
    return re.match(r'^[A-Z]+$', t)

def get_token_candidate_at(s, idx):
    return s[idx:idx+2]

def emit_items(s):
    delim_start = -1
    for i,_ in enumerate(s):
        token = get_token_candidate_at(s, i)

        if is_token(token):
            if delim_start >= 0:
                yield s[delim_start:i]
            delim_start = i

    if delim_start > 0: # get the last one
        yield s[delim_start:]

> list(emit_items("WE 12 (09/09) ER 14 (12/56) TO 90 (45/67)"))
  ['WE 12 (09/09) ', 'ER 14 (12/56) ', 'TO 90 (45/67)']

> list(emit_items("WE12(09/09)ER14(12/56)TO90(45/67)"))
  ['WE12(09/09)', 'ER14(12/56)', 'TO90(45/67)']

> list(emit_items("AZ893249EE886342TT125435"))
  ['AZ893249', 'EE886342', 'TT125435']

If they have a different start, you can change is_token and get_token_candidate_at to meet those different requirements.

If the pattern really is periodic, then you might be able to get away with something akin to frequency analysis, but then you have to know something about "what" is periodic (like 'non-numbers are period') and then hope the string is long enough to provide a meaningful periodic signal. This is what @zwer is getting at..."what are the properties of the pattern that you are expecting".

Doddie
  • 1,393
  • 13
  • 21