0

I have a pandas dataframe which consists of two strings and one keyword per entry. It looks like this:

    \n  05 Temmuz 2016 17:59                                  \
    0  Suriyelilere vatandaşlığa neden karşı çıkılıyor                                           
    1  Selin Girit Kendi ülkesinde savaştan kaçacak s...                                           

    \n 10 Temmuz 2016 09:01                                  \
    0  Öteki Suriyeliler: Türkiye vatandaşı olursak a...                                           
    1  Cumhurbaşkanı Tayyip Erdoğan Suriyelilere vata...                                           

What I'm trying to do is using sci-kit learn get the tf-idf of each word in the second string and compare it to a corpus of general words. But I'm not really sure how to do that. If I use tfidfVectorize() I end up with something that looks like this:

    (0, 1)  0.520040083208
    (0, 8)  0.307144050546
    (0, 5)  0.307144050546
    (0, 4)  0.520040083208
    (0, 7)  0.520040083208
    (1, 8)  0.326309521953
    (1, 5)  0.326309521953
    (1, 3)  0.420182921489
    (1, 2)  0.552490047084
    (1, 0)  0.552490047084
    (2, 8)  0.294893556078
    (2, 5)  0.294893556078
    (2, 3)  0.759458290886
    (2, 6)  0.499298193039

But this output isn't for every word individually and it's a comparison between words in the dictionary not a general corpus... I'm not sure how to do what I'm looking for, and I was hoping someone might have some advice as the Sci-Kit Learn documentation isn't very clear.

  • Can you format your dataframes a little better it's difficult to interpret what is an index what is a column header and what is data – Grr Apr 19 '17 at 20:16
  • @Grr Yes, sorry! I think that's partially what I'm confused about. I've never used pandas before so the format itself for the dataframe still looks odd to me... I've formatted more clearly, so you can now see there are three lines per entry, the first is the key, and in this case a date, the second two are strings. I'm interested in the text of the second string, which in this case is the content of a newspaper article. – user3102350 Apr 20 '17 at 00:26
  • That output is not the comparison between words. Its the sparse output of array (only those elements which have non-zero values are displayed with (i,j) being the row and column of that element. See my [other answer for details](http://stackoverflow.com/questions/43154039/how-to-calculate-term-document-matrix/43154534#43154534) on that. The words for which it is calculated can be returned using tfidf.get_feature_names(). – Vivek Kumar Apr 20 '17 at 02:01
  • Other than that, I unclear as to what you want to do. Please describe the input and required output a bit more. – Vivek Kumar Apr 20 '17 at 02:02

0 Answers0