How to identify the occurrence of items in a list against another list

Question

I have a file with a column of text which I have loaded. I would like to check the occurrence of country names in the loaded text. I have loaded the Wikipedia countries CSV file and I am using the following code to count the number of occurrences of country names in the loaded text.

My code is not working.

Here is my code: text = pd.read_sql(select_string, con) text['tokenized_text'] = mail_text.apply(lambda col:nltk.word_tokenize(col['SomeText']), axis=1) country_codes = pd.read_csv('wikipedia-iso-country-codes.csv') ccs = set(country_codes['English short name lower case']) count_occurrences=Counter(country for country in text['tokenized_text']if country in ccs)

The current code you have has an indentation error - you should see to that, first. — Wayne Werner, Sep 20 '16 at 08:51
No, the indentation was just the result of my cut and paste here — JayDoe, Sep 20 '16 at 10:24
country_codes is the dataframe with the list of countries from wikipedia — JayDoe, Sep 20 '16 at 10:25

score 1 · Answer 1 · edited May 23 '17 at 12:19

1

In your original code the line

dic[country]= dic[country]+1

should raise a KeyError, because the key is not yet present in the dictionary, when a country is met for the first time. Instead you should check if the key is present, and if not, initialize the value to 1.

On the other hand, it will not, because the check

if country in country_codes['English short name lower case']:

yields False for all values: a Series object's __contains__ works with indices instead of values. You should for example check

if country in country_codes['English short name lower case'].values:

if your list of values is short.

For general counting tasks Python provides collections.Counter, which acts a bit like a defaultdict(int), but with added benefits. It removes the need for manual checking of keys etc.

As you already have DataFrame objects, you could use the tools pandas provides:

In [12]: country_codes = pd.read_csv('wikipedia-iso-country-codes.csv')

In [13]: text = pd.DataFrame({'SomeText': """Finland , Finland , Finland
    ...: The country where I want to be
    ...: Pony trekking or camping or just watch T.V.
    ...: Finland , Finland , Finland
    ...: It's the country for me
    ...: 
    ...: You're so near to Russia
    ...: so far away from Japan
    ...: Quite a long way from Cairo
    ...: lots of miles from Vietnam
    ...: 
    ...: Finland , Finland , Finland
    ...: The country where I want to be
    ...: Eating breakfast or dinner
    ...: or snack lunch in the hall
    ...: Finland , Finland , Finland
    ...: Finland has it all
    ...: 
    ...: Read more: Monty Python - Finland Lyrics | MetroLyrics
    ...: """.split()})

In [14]: text[text['SomeText'].isin(
    ...:     country_codes['English short name lower case']
    ...: )]['SomeText'].value_counts().to_dict()
    ...:
Out[14]: {'Finland': 14, 'Japan': 1}

This finds the rows of text where the SomeText column's value is in the English short name lower case column of country_codes, counts unique values of SomeText, and converts to dictionary. The same with descriptive intermediate variables:

In [49]: where_sometext_isin_country_codes = text['SomeText'].isin(
    ...:     country_codes['English short name lower case'])

In [50]: filtered_text = text[where_sometext_isin_country_codes]

In [51]: value_counts = filtered_text['SomeText'].value_counts()

In [52]: value_counts.to_dict()
Out[52]: {'Finland': 14, 'Japan': 1}

The same with Counter:

In [23]: from collections import Counter

In [24]: dic = Counter()
    ...: ccs = set(country_codes['English short name lower case'])
    ...: for country in text['SomeText']:
    ...:     if country in ccs:
    ...:         dic[country] += 1
    ...: 

In [25]: dic
Out[25]: Counter({'Finland': 14, 'Japan': 1})

or simply:

In [30]: ccs = set(country_codes['English short name lower case'])

In [31]: Counter(country for country in text['SomeText'] if country in ccs)
Out[31]: Counter({'Finland': 14, 'Japan': 1})

edited May 23 '17 at 12:19

Community

1
1

answered Sep 20 '16 at 08:45

Ilja Everilä

50,538
7
126
127

So what happened to Russia and Vietnam? Are they no longer countries? I think that the source data could be better... – Frangipanes Sep 20 '16 at 09:06
1

Russia is there, but it's not just "Russia", but the "Russian Federation". Vietnam on the other hand was not. OPs data and method could use some improvement. – Ilja Everilä Sep 20 '16 at 09:08
Good point about Russia because it is never referred to as "Russian Federation" but just as "Russia" so maybe I need to find another source file for country codes? – JayDoe Sep 20 '16 at 09:15
Combining multiple data sources could help. Fuzzier matching as well, but that's getting besides the point a bit. In the end you probably have to accept some margin of error and go with it. A note about these kinds of tasks (check if list's elements exist in another): get used to using `set`s in the general case. A contains check is *O(1)* for sets, *O(n)* for sequences. – Ilja Everilä Sep 20 '16 at 09:17
Do you mean convert the dataframe to a set and use intersection to identify the common elements? Would it work in this case? – JayDoe Sep 20 '16 at 09:27
There's no need in this case. Using intersection would not work for you anyway, as that'd just return what common elements 2 sets have, but without repetitions. What I meant was that if you'd have to work with plain lists etc., then it'd pay up to check for `item in a_set` instead of `item in a_list`. – Ilja Everilä Sep 20 '16 at 09:31
As to why it is not necessary in this case, a recent enough pandas [does the converting to set for you](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L161). – Ilja Everilä Sep 20 '16 at 09:41
Sorry, I am still confused by the solution provided above :( I only have one column 'SomeText' - why is it being split()? and why are the identified elements not being added to a dictionary so that I can count them? – JayDoe Sep 20 '16 at 10:03
You did not provide a complete example with data, so I had to roll my own. It is just example data, don't mind it. The counting is done using pandas' tools (because you already have `DataFrame`s, and pandas is usually plenty fast), converting the final result to a dictionary. – Ilja Everilä Sep 20 '16 at 10:14
So where is the resulting dictionary with elements and their number of occurrences? Is it in text['SomeText']? – JayDoe Sep 20 '16 at 10:32
1

At this point I must advice you to read or reread the [python tutorial](https://docs.python.org/3/tutorial/). It seems you have some difficulties with the very basic concepts of python. – Ilja Everilä Sep 20 '16 at 10:37
Thanks for the breakdown, it makes more sense to me now :) – JayDoe Sep 20 '16 at 10:41
Ok at the risk of sounding stupid, it is still not working. Using the 'set' solution provided: I have my set of country codes - CCS - and just to make sure i printed them out and they are all there, so the problem statement is: Counter(country for country in text['SomeText'] if country in ccs) It is not returning anything at all. I feel like I am missing something very obvious but I just cannot see what it is! – JayDoe Sep 20 '16 at 23:23
Do you assign the expression's result to a variable? – Ilja Everilä Sep 21 '16 at 05:07
Yes, and it just gives an empty Counter: Counter() – JayDoe Sep 21 '16 at 07:52
Hard to say this or that without the *actual data*. Perhaps you've tokenized your text in a way that has concatenated punctuation with words. Think `"Finland"` vs. `"Finland."`. The latter is not a match. You are really pushing the boundaries of a [minimal, **complete** and **verifiable** example](http://stackoverflow.com/help/mcve). In other words: we cannot help you with incomplete information. What does your `DataFrame` *text* actually contain? And if you wish to add that information, *edit it in to the question itself, do not comment*. – Ilja Everilä Sep 21 '16 at 07:54
Thanks for the tip!, I have amended my code and tokenized the extracted text (see amendment in question). The only problem is that this produces a list which gives a 'type error: Unhashable type'. Do I need to amend the loop because it is a list? – JayDoe Sep 21 '16 at 12:39

How to identify the occurrence of items in a list against another list

1 Answers1