1

I would like to count the occurrences of a list of words for every article contained in a single text file. Each article can be identified since they all start with a common tag "< p > Advertisement'".

This is a sample of the text file:

"[<p>Advertisement ,   By   TIM ARANGO  ,     SABRINA TAVERNISE   and     CEYLAN YEGINSU    JUNE 28, 2016 
 ,Credit Ilhas News Agency, via Agence France-Presse — Getty Images,ISTANBUL ......]
[<p>Advertisement ,   By  MILAN SCHREUER  and     ALISSA J. RUBIN    OCT. 5, 2016 
 ,  BRUSSELS — A man wounded two police officers with a knife in Brussels around noon 
on Wednesday in what the authorities called “a potential terrorist attack.” ,  
The two ......]" 

What I would like to do is counting the frequency of each word I have one a csv file(20 words) and write the output like this:

  id, attack, war, terrorism, people, killed, said 
  article_1, 45, 5, 4, 6, 2,1
  article_2, 10, 3, 2, 1, 0,0

The words in the csv are stored like this:

attack
people
killed
attacks
state
islamic

As suggested I am first trying to split the whole text file by the tag <p> before starting to count the words. Then I tokenized the list in the file text.

This is what I have so far:

opener = open("News_words_most_common.csv")
words = opener.read()
my_pattern = ('\w+')
x = re.findall(my_pattern, words)

file_open = open("Training_News_6.csv")
files = file_open.read()
r = files.lower()
stops = set(stopwords.words("english"))
words = r.split("<p>")
token= word_tokenize(words)
string = str(words)
token= word_tokenize(string)
print(token)

This is the output:

['[', "'", "''", '|', '[', "'", ',', "'advertisement", 
',', 'by', 'milan', 'schreuer'.....']', '|', "''", '\\n', "'", ']']

The next step will be looping around the articles splitted (now turned in list of words tokenized) and counting the frequency of the words from the first file. If you have any suggestion in how to interate and count please let me know!

I am using Python 3.5 on Anaconda

M.Huntz
  • 253
  • 1
  • 6
  • 17
  • related http://stackoverflow.com/a/14921469/4063051 – glS Nov 15 '16 at 14:23
  • yes it is related. I know how to use the counter module. I already did it to create the list of words. The big deal is counting the frequencies of the words in my list in each article contained in my single text file. – M.Huntz Nov 15 '16 at 14:33

4 Answers4

1

You could try reading your text file, then splitting at the '<p>' (if, as you say, they are used to mark the beginning of new articles) and then you have a list of articles. A simple loop with count will do.

I would recommend you take a look at the nltk module. I am not sure what your end goal is but nltk has really easy to implement functions to do these sort of things and much more (for example instead of just looking at the number of times a word appears in each article, you could calculate frequency, and even scale it by inverse document frequency, known as tf-idf).

tomasn4a
  • 575
  • 4
  • 15
  • I edited my question with your suggestion. Yes, I already used the nltk tf function for the first part of the task. But I didn't use the tf-idf for the afore-mentioned problem (splitting the text in different articles). However, I don't know if I am using correctly the split module – M.Huntz Nov 15 '16 at 16:10
1

You can try to use pandas and sklearn:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vocabulary = [word.strip() for word in open('vocabulary.txt').readlines()]
corpus = open('articles.txt').read().split('<p>Advertisement')

vectorizer = CountVectorizer(min_df=1, vocabulary=vocabulary)
words_matrix = vectorizer.fit_transform(corpus)
df = pd.DataFrame(data=words_matrix.todense(), 
                  index=('article_%s' % i for i in range(words_matrix.shape[0])),
                  columns=vectorizer.get_feature_names())
df.index.name = 'id'
df.to_csv('articles.csv')

In file articles.csv:

$ cat articles.csv
id,attack,people,killed,attacks,state,islamic
article_0,0,0,0,0,0,0
article_1,0,0,0,0,0,0
article_2,1,0,0,0,0,0
Eugene Lisitsky
  • 12,113
  • 5
  • 38
  • 59
0

Perhaps I didn't get the task well...

If you are making a text categorisation it could be handy to use standard scikit vectorizers, for example Bag of Words, which takes a text and returns an array with words. You may use it directly in classifiers or output to csv if you really need csv. It already included into scikit and Anaconda.

Anoter way - is to split manually. You can load data, split into words, count them, exclude stopwords (what's it?) and put into output result file. Like:

    import re
    from collections import Counter
    txt = open('file.txt', 'r').read()
    words = re.findall('[a-z]+', txt, re.I)
    cnt = Counter(_ for _ in words if _ not in stopwords)
Eugene Lisitsky
  • 12,113
  • 5
  • 38
  • 59
  • First of all thanks for the help. Actually the task is already defined. I already have the most frequent words ( in the whole document file) in another csv file. Now what I have to do, is to count the frequency of those words (20 words) in each article. By keeping in my that each article is stored in a single csv file. – M.Huntz Nov 18 '16 at 18:27
  • The final output should be like Article_1, 3, 4, 45, 32 etc etc. And the numbers indicate the frequence of the words (from the csv file) in each article. – M.Huntz Nov 18 '16 at 18:28
0

How about this:

import re
from collections import Counter
csv_data = [["'", "\\n", ","], ['fox'],
            ['the', 'fox', 'jumped'],
            ['over', 'the', 'fence'],
            ['fox'], ['fence']]
key_words = ['over', 'fox']
words_list = []

for i in csv_data:
    for j in i:
        line_of_words = ",".join(re.findall("[a-zA-Z]+", j))
        words_list.append(line_of_words)
word_count = Counter(words_list)

match_dict = {}
for aword, word_freq in zip(word_count.keys(), word_count.items()):
    if aword in key_words:
        match_dict[aword] = word_freq[1]

Which results in:

print('Article words: ', words_list)
print('Article Word Count: ', word_count)
print('Matches: ', match_dict)

Article words:  ['', 'n', '', 'fox', 'the', 'fox', 'jumped', 'over', 'the', 'fence', 'fox', 'fence']
Article Word Count:  Counter({'fox': 3, '': 2, 'the': 2, 'fence': 2, 'n': 1, 'over': 1, 'jumped': 1})
Matches:  {'over': 1, 'fox': 3}
  • Thanks for the suggestion, the main problem is that the words should be counted in any list nested: so that I have the frequency counted for the first article, the second etc. (separately). In your code it is counted the frequency for all the articles at the same time – M.Huntz Nov 19 '16 at 11:04
  • Since the articles are split by "

    " you can first loop over the nested csv data while adding all elements to a list until you encounter a "

    " in which case start a new list and begin adding all elements to that and so on. Then you can run the approach above on each list.

    –  Nov 19 '16 at 23:01