I would like to count the occurrences of a list of words for every article contained in a single text file. Each article can be identified since they all start with a common tag "< p > Advertisement'".
This is a sample of the text file:
"[<p>Advertisement , By TIM ARANGO , SABRINA TAVERNISE and CEYLAN YEGINSU JUNE 28, 2016
,Credit Ilhas News Agency, via Agence France-Presse — Getty Images,ISTANBUL ......]
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around noon
on Wednesday in what the authorities called “a potential terrorist attack.” ,
The two ......]"
What I would like to do is counting the frequency of each word I have one a csv file(20 words) and write the output like this:
id, attack, war, terrorism, people, killed, said
article_1, 45, 5, 4, 6, 2,1
article_2, 10, 3, 2, 1, 0,0
The words in the csv are stored like this:
attack
people
killed
attacks
state
islamic
As suggested I am first trying to split the whole text file by the tag <p>
before starting to count the words. Then I tokenized the list in the file text.
This is what I have so far:
opener = open("News_words_most_common.csv")
words = opener.read()
my_pattern = ('\w+')
x = re.findall(my_pattern, words)
file_open = open("Training_News_6.csv")
files = file_open.read()
r = files.lower()
stops = set(stopwords.words("english"))
words = r.split("<p>")
token= word_tokenize(words)
string = str(words)
token= word_tokenize(string)
print(token)
This is the output:
['[', "'", "''", '|', '[', "'", ',', "'advertisement",
',', 'by', 'milan', 'schreuer'.....']', '|', "''", '\\n', "'", ']']
The next step will be looping around the articles splitted (now turned in list of words tokenized) and counting the frequency of the words from the first file. If you have any suggestion in how to interate and count please let me know!
I am using Python 3.5 on Anaconda