I'm one of those later people :)
So my understanding with TF-IDF is the IDF is computed the frequency of the word (or Ngram) in both documents? So comparing what matches with each, doesn't really cover how common the word is in both documents for weeding out common words? Is there a way to do that with Ngrams without the indice error?
ValueError: Shape of passed values is (26736, 1), indices imply (60916, 1)
# Applying TFIDF to vectors
#instantiate tfidVectorizers()
ngram_vectorizer1 = TfidfVectorizer(ngram_range = (2,2)) #bigrams 1st vector
ngram_vectorizer2 = TfidfVectorizer(ngram_range = (2,2)) #bigrams 2nd
ngram_vectorizert = TfidfVectorizer(ngram_range = (2,2)) #bigrams total
# fit model
ngram_vector1 = ngram_vectorizer1.fit_transform(text)
ngram_vector2 = ngram_vectorizer2.fit_transform(text2)
ngram_vectort = ngram_vectorizert.fit_transform(total)
ngramfeatures1 = (ngram_vectorizer1.get_feature_names()) #save feature names
ngramfeatures2 = (ngram_vectorizer2.get_feature_names()) #save feature names
ngramfeaturest = (ngram_vectorizert.get_feature_names())
print("\n\nngramfeatures1 : \n", ngramfeatures1)
print("\n\nngramfeatures2 : \n", ngramfeatures2)
print("\n\nngram_vector1 : \n", ngram_vector1.toarray())
print("\n\nngram_vector2 : \n", ngram_vector2.toarray())
#Compute the IDF values
first_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
second_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
total_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
first_tfidf_transformer_ngram.fit(ngram_vector1)
second_tfidf_transformer_ngram.fit(ngram_vector2)
total_tfidf_transformer_ngram.fit(ngram_vectort)
# print 1st idf values
ngram_first_idf = pd.DataFrame(first_tfidf_transformer_ngram.idf_, index=ngram_vectorizer1.get_feature_names(),columns=["idf_weights"])
# sort ascending
ngram_first_idf.sort_values(by=['idf_weights']) #this one should really be looking towards something from the "Total" calculations if I'm understanding it correctly? ```