0

I have a Python code which classifies a piece of news as either fake or real. TfidfVectorizer is used to clean the data and Passive Aggressive Classifier is used to model the fake news detector. Could someone tell me what line of code I should used to display the 30 most common words used in both the fake news and real news? And how do I draw a bar plot to show the frequency of these words?

%matplotlib inline
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import itertools
import json
import csv
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier  
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

df = pd.read_csv(r".\fake_news(1).csv", sep=',', header=0, engine='python', escapechar='\\')
#print(df)
#df.shape
df.head()
#df.head().to_dict()

headline1 = df.headline
headline1.head()

trainx, testx, trainy, testy = train_test_split(df['headline'], is_sarcastic_1, test_size = 0.2, random_state = 7)

tvector = TfidfVectorizer(strip_accents='ascii', stop_words='english', max_df=0.5)
ttrain = tvector.fit_transform(trainx)
ttest = tvector.transform(testx)

pac = PassiveAggressiveClassifier(max_iter=100)
pac.fit(ttrain, trainy)

y_pred = pac.predict(ttest)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100,2)}%')

corpus = ['dem rep. totally nails why congress is falling short on gender, racial equality',
  'eat your veggies: 9 deliciously different recipes',
'inclement weather prevents liar from getting to work',
"mother comes pretty close to using word 'streaming' correctly"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
John Rambo
  • 25
  • 6
  • This should help: https://stackoverflow.com/questions/34232190/scikit-learn-tfidfvectorizer-how-to-get-top-n-terms-with-highest-tf-idf-score – MjH Dec 15 '19 at 00:21
  • Please, show what you have done so far, otherwise we won't be able to help you. Post how you're extracting the tfidf scores and so on. – Tiago Duque Dec 15 '19 at 00:27
  • Tiago, I have now posted the entire code. – John Rambo Dec 15 '19 at 00:40

1 Answers1

2

You need to understand what is returned after .fit_transform(corpus). It's a matrix where rows are sentences in your corpus, and columns are words aka features. Values are words/features Tfidf, mind that those are not counts of words (read https://en.wikipedia.org/wiki/Tf%E2%80%93idf). So in order to find word/feature Tfidf for entire corpus you just need to sum columns.

import numpy as np
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer()

corpus = ['dem rep. totally nails why congress is falling short on gender, racial equality',
  'eat your veggies: 9 deliciously different recipes',
'inclement weather prevents liar from getting to work',
"mother comes pretty close to using word 'streaming' correctly"]

X = vect.fit_transform(corpus)

# zipping actual words and sum of their Tfidf for corpus
features_rank = list(zip(vect.get_feature_names(), [x[0] for x in X.sum(axis=0).T.tolist()]))

# sorting
features_rank = np.array(sorted(features_rank, key=lambda x:x[1], reverse=True))

n = 10
plt.figure(figsize=(5, 10))
plt.barh(-np.arange(n), features_rank[:n, 1].astype(float), height=.8)
plt.yticks(ticks=-np.arange(n), labels=features_rank[:n, 0])

result

MjH
  • 1,170
  • 1
  • 8
  • 16
  • MjH, thank you, much appreciated ! And yes, you're right, I need to learn what the functions (methods) actually mean. – John Rambo Dec 15 '19 at 03:21