3

I'm trying to write a Python script that will search through a CSV file and identify the number of occurrences when two items appear next to each other.

For example, let's say the CSV looks like the following:

red,green,blue,red,yellow,green,yellow,red,green,purple,blue,yellow,red,blue,blue,green,purple,red,blue,blue,red,green 

And I'd like to find the number of times when "red,green" occurs next to each other (but I'd like a solution that isn't just specific to the words in this CSV).

So far, I thought that possibly converting the CSV to a list might be a good start:

import csv
with open('examplefile.csv', 'rb') as f:
    reader = csv.reader(f)
    your_list = list(reader)

print your_list

Which returns:

[['red', 'green', 'blue', 'red', 'yellow', 'green', 'yellow', 'red', 'green', 'purple', 'blue', 'yellow', 'red', 'blue', 'blue', 'green', 'purple', 'red', 'blue', 'blue', 'red', 'green ']]

In this list, there are three occurrences of 'red', 'green' — what is an approach/module/loop structure I could use to find out if there are more than one occurrence of two items in a list that are right next to each other in a list?

AdjunctProfessorFalcon
  • 1,790
  • 6
  • 26
  • 62

2 Answers2

4

What you are looking for are called bigrams (pairs of two words). You usually see these in problem with text-mining/NLP-type problems. Try this:

from itertools import islice, izip
from collections import Counter
print Counter(izip(your_list, islice(your_list, 1, None)))

which returns:

Counter({('red', 'green'): 3, ('red', 'blue'): 2, ('yellow', 'red'): 2, ('green', 'purple'): 2, ('blue', 'blue'): 2, ('blue', 'red'): 2, ('purple', 'blue'): 1, ('red', 'yellow'): 1, ('green', 'blue'): 1, ('purple', 'red'): 1, ('blue', 'yellow'): 1, ('blue', 'green'): 1, ('yellow', 'green'): 1, ('green', 'yellow'): 1})

if you need to just get the items with more than 1 occurrence, treat the Counter object just like a python dict.

counts = Counter(izip(your_list, islice(your_list, 1, None)))
print [k for k,v in counts.iteritems() if v  > 1]

So you just have the relevant pairs:

[('red', 'blue'), ('red', 'green'), ('yellow', 'red'), ('green', 'purple'), ('blue', 'blue'), ('blue', 'red')]

See this post from where I borrowed some code: Counting bigrams (pair of two words) in a file using python

Sahan Serasinghe
  • 1,591
  • 20
  • 33
DG1
  • 171
  • 1
  • 8
  • @DGI This is great, thank you! Do you mind breaking down how this line is working? `Counter(izip(your_list, islice(your_list, 1, None)))` – AdjunctProfessorFalcon Jun 05 '15 at 05:13
  • Can this approach be adopted of you're trying to find occurrences of sets words together? For example, 'star wars', 'space balls' ? – AdjunctProfessorFalcon Jun 05 '15 at 14:26
  • 2
    @gillbates, islice iterates though the list starting from element 1 until the end. izip zips the list starting from element 0 with slice thereby grouping adjacent words together. Counter then iterates though the zipped pairs and counts occurrences. Look up slicing and zipping in python if this is unclear and then look at how itertools turns these operations on lists into iterators. – DG1 Jun 05 '15 at 15:58
  • @DG1 Let's say you have a CSV file that looks like "red star cafe, blue bull coffee shop, cozy cafe, coffee shop shop,...." etc - and I'd like to know the amount of times that "red star cafe" and "blue bull coffee shop" appeared next to each other in the CSV file. Can I tweak the bigrams method for that or that won't work because I'm looking for groups of words as a pair? – AdjunctProfessorFalcon Jun 05 '15 at 18:44
  • 1
    Understood. if ` your_list = ['red star cafe', 'blue bull coffee shop', 'starbucks', 'blue bull coffee shop', 'bob's coffee', 'red star cafe', 'blue bull coffee shop']` then this method will work exactly as is... it is not parsing words within the elements, just matching elements of the list. – DG1 Jun 05 '15 at 19:22
  • @DG1 Thanks for the breakdown, appreciate it! – AdjunctProfessorFalcon Jun 06 '15 at 04:43
1

This will check for both 'red','green' and 'green','red' combinations in one go:

pair = ('red', 'green')
positions = [i for i in xrange(len(l)-1) if ((l[i],l[i+1]) == pair or (l[i+1],l[i]) == pair)]
print positions
>>> [0, 7] # notice that your last entry was 'green ' not 'green'

The output prints the i'th index at which the pattern started.

Testing with your example (corrected at the end for 'green '):

l = [['red', 'green', 'blue', 'red', 'yellow', 'green', 'yellow', 'red', 'green', 'purple', 'blue', 'yellow', 'red', 'blue', 'blue', 'green', 'purple', 'red', 'blue', 'blue', 'red', 'green ']]
l = l[0]

# add another entry to test reversed matching
l.append('red')

pair = ('red', 'green')
positions = [i for i in xrange(len(l)-1) if ((l[i],l[i+1]) == pair or (l[i+1],l[i]) == pair)]

print positions
>>> [0, 7, 20, 21]

if positions > 1:
    print 'do stuff'
Alexander McFarlane
  • 10,643
  • 9
  • 59
  • 100