0

I have a column with text data. Sample is as shown below.

                 column1
                  Apple
                  Mango
                  Grape
                  banana
                  Apple
                  Mango
                  Fruit

If you look at the data, apple is followed by mango. Or it can be stated as whenever apple occurs next mango will occur. There might be more than one such matching. How can this be found. I know text similarity finding techniques done in nlp. But how to approach this kind of situation. Any suggestion please.

Learner
  • 483
  • 8
  • 17
  • It seems like you're looking for bigrams, like in [this](https://stackoverflow.com/q/21844546) or [this other](https://stackoverflow.com/q/37651057) question – arturomp Aug 01 '17 at 14:54
  • If you're looking for 100% predictability, then this is *not* an ML problem, it's straightforward programming. If you're looking for "usually follows", then you need to look at those NLP techniques you mentioned. Either way, please refine this question to a fitting Stack Overflow posting. – Prune Aug 01 '17 at 17:08
  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation. [on topic](http://stackoverflow.com/help/on-topic) and [how to ask](http://stackoverflow.com/help/how-to-ask) apply here. – Prune Aug 01 '17 at 17:09

1 Answers1

1

Without using ML:

col = ['Apple', 'Mango', 'Grape', 'banana', 'Apple', 'Mango', 'Fruit']
for wrd in set(col):
    indices=[i for i, x in enumerate(col) if x == wrd]
    if len(col)-1 in indices:
        continue #Last element cannot be followed by anything
    elif len(indices) ==1:
        continue #Do we want single elements? I suppose not
    elif len(set([col[i+1] for i in indices])) ==1:
        print(wrd+" is always followed by "+col[indices[0]+1])

> Apple is always followed by Mango
aless80
  • 3,122
  • 3
  • 34
  • 53