1

I am new to coding and could use help. Here is my task: I have a csv of online marketing image titles. It is a single column. Each cell in this column holds the marketing image title text for each ad. It is just a string of words. For instance cell A1 reads: "16 Maddening Tire Fails" and etc etc. To load csv I do:

with open('usethis.csv', 'rb') as f:
    mycsv = csv.reader(f)
    mycsv = list(mycsv)

I initialize a list:

mylist = []

my desire is to take the text in each cell and extract the bigrams. I do that as follows:

for i, c in enumerate(mycsv):
   mylist.append(list(nltk.bigrams(word_tokenize(' '.join(c)))))

mylist then looks like this, but with more data:

[[('16', 'Maddening'), ('Maddening', 'Tire'), ('Tire', 'Fails')], [('16', 'Maddening'), ('Maddening', 'Tire'), ('Tire', 'Fails'), ('Fails', 'That'), ('That', 'Show'), ('Show', 'What'), ('What', 'True'), ('True', 'Negligence'), ('Negligence', 'Looks'), ('Looks', 'Like')]

mylist holds individual lists which are the bigrams created from each cell in my csv.

Now I am wanting to loop through every bigram in all lists and next to each bigram print the number of times it appears in another list (cell). This would be the same as a countifs in excel, basically. For instance, if the bigram "('16', 'Maddening')" in the first list (cell A1) appears 3 other times in (mylist) then print the number 3 next to it. And so on for each bigram. If it is easier to return this information into a new list that's fine. Just printing it out somewhere that makes sense.

I have done a lot of reading online, for instance this link kind of was along the general idea: How to check if all elements of a list matches a condition?

And also this link about dictionaries was similar in that it is returning a number next to each value as I want to return a count next to each bigram.. What are Python dictionary view objects?....

But I really am at a loss as to how to do this. Thank you so much in advance for your help! Let me know if I need to explain something better.

Josh Flori
  • 295
  • 1
  • 2
  • 13
  • `' '.join(c)` is pointless-- looks like it's just an ugly way of getting the string `c[0]`, the only column as you say. Just write `c[0]`. – alexis Sep 27 '17 at 20:50

1 Answers1

3

You can use collections.Counter for this task. Since you are already using NLTK, FreqDist and and derived classes might come in handy when you want to do more than just counting, but for now let's stick with the simpler Counter.

Counter is a subclass of dict, ie. it can do everthing a dictionary can, but it has additional functionality.

The following snippet extends the code you showed:

from collections import Counter

bigram_counts = Counter()
for cell in mylist:
    for bigram in cell:
        bigram_counts[bigram] += 1

After this, you can look up individual bigrams with subscript, eg. bigram_counts['16', 'Maddening'] will return 3 or whatever the actual count was. With bigram_counts.most_common(5) you get the 5 most frequent bigrams.

Update

... to actually answer the specific problem in your question.

In order to know the number of occurrences in all but one cell, you need to have separate counters for each cell. Replace the previous snippet with the following:

# Populate n+1 counters.
bigram_totals = Counter()
separate_counters = []
for cell in mylist:
    bigram_current = Counter()
    separate_counters.append(bigram_current)
    for bigram in cell:
        bigram_totals[bigram] += 1
        bigram_current[bigram] += 1

# Look up all bigram counts.
for cell, bigram_current in zip(mylist, separate_counters):
    for bigram in cell:
        count = bigram_totals[bigram] - bigram_current[bigram]
        # print(bigram, count) or whatever...

So, in addition to the total counts, we have a separate counter for each cell. When doing a lookup, we subtract the local count from the global count to get the sum of occurrences everywhere else.

Btw, since you mentioned learning purposes, the first block can be written a bit shorter by taking advantage of special Counter features:

# Populate n+1 counters.
bigram_totals = Counter()
separate_counters = []
for cell in mylist:
    bigram_current = Counter(cell)
    separate_counters.append(bigram_current)
    bigram_totals.update(bigram_current)

I think this is a bit more elegant, but might harder to understand for a beginner. Decide for yourself which version you think is more readable.

lenz
  • 5,658
  • 5
  • 24
  • 44
  • Sorry, just saw I didn't fully answer your question – these are just global counts. What you ask for is a bit more involved and requires *n* counters if you have *n* lists. Do you really need to know how often a bigram appears in all other cells? Maybe you can give a bit more context (why you need exactly this). – lenz Sep 27 '17 at 18:45
  • Global counts are fine because I can just simply deduct the number of occurrences in each cell from that count. But an example of n counters where the occurrence in the ith cell is not counted would be helpful for my learning purposes. – Josh Flori Sep 27 '17 at 18:51
  • My desire is simply to explore why some marketing titles have a better click through rate than others. What text features trigger a response in the reader? One basic theory is that, naturally, the sequence of words matter( because after all that's all text is), so I'm wondering if higher performing titles have more unique bi-trigrams compared to the rest of the titles - where uniqueness points toward some positive pairing of words IDK, just one thought worth exploring... but your code is definitely what i was looking for, basically, thanks! Two for statements would be what i was missing – Josh Flori Sep 27 '17 at 18:53
  • when I try printing results as follows def mycounts(list): for cell in mylist: for bigram in cell: return {bigram, bigram_counts(bigram) it says counter is not callable. What is the appropriate approach here? – Josh Flori Sep 27 '17 at 18:55
  • Take care to distinguish round parens `()`, square brackets`[]`, and curly braces `{}` – they have distinct meanings. For accessing an element/value of a list/dict with the so-called "subscript notation", use brackets. – lenz Sep 27 '17 at 20:17
  • If you're looking for unique bigrams, you can probably get pretty close by looking at bigrams with a count of 1 (but you will miss the ones that appear multiple times in the same cell, but nowhere else). Anyway, I added an example how to do it eagerly. If you think this post answered your question, you can "accept" it with the check mark below the vote count. – lenz Sep 27 '17 at 20:49
  • Awesome, i will explore this later, a cursory glance looks like it should work. Thanks! – Josh Flori Sep 28 '17 at 16:00