2

Im doing a project that require me to sort the document to match with topic.

For example, I have 4 topics which is Lecture, Tutor, Lab and Exam. I have some sentences which are:

  1. Lecture was engaging
  2. Tutor is very nice and active
  3. The content of lecture was too much for 2 hours.
  4. Exam seem to be too difficult compare with weekly lab.

And now I wanna sort these sentences into topic above, result should be:

  • Lecture: 2
  • Tutor: 1
  • Exam: 1

I did research and the most instruction I found is using LDA topic modeling. But seem like cannot solve my problem because as I know LDA support for identifying topic in document, and dont know how to pre-choose topic manually.

Could anyone help me please? Im stuck with that.

Austin
  • 23
  • 3
  • 1
    Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [On topic](http://stackoverflow.com/help/on-topic), [how to ask](http://stackoverflow.com/help/how-to-ask), and [... the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. StackOverflow is an archive of specific programming solutions. Your question seems to want generic guidance for attacking a loosely defined application. – Prune Aug 29 '18 at 22:47
  • The method for doing this could vary greatly depending on the document type. What are we working with? – emsimpson92 Aug 29 '18 at 23:11
  • Im working on csv file and alr know how to read the file. – Austin Aug 29 '18 at 23:42
  • Possible dupe of https://stackoverflow.com/questions/3113428/classifying-documents-into-categories. Also, too broad. – alvas Aug 30 '18 at 03:37

4 Answers4

6

This is the excellent example to use something smarter than string matching =)

Lets consider this:

  • Is there a way to convert each word to a vector form (i.e. an array of floats)?

  • Is there a way to convert each sentence to the same vector form (i.e. an array of floats the same dimensions as the word's vector form?


First lets get a vocabulary to all the words possible in your list of sentences (let's call it a corpus):

>>> from itertools import chain
>>> s1 = "Lecture was engaging"
>>> s2 = "Tutor is very nice and active"
>>> s3 = "The content of lecture was too much for 2 hours."
>>> s4 = "Exam seem to be too difficult compare with weekly lab."
>>> list(map(word_tokenize, [s1, s2, s3, s4]))
[['Lecture', 'was', 'engaging'], ['Tutor', 'is', 'very', 'nice', 'and', 'active'], ['The', 'content', 'of', 'lecture', 'was', 'too', 'much', 'for', '2', 'hours', '.'], ['Exam', 'seem', 'to', 'be', 'too', 'difficult', 'compare', 'with', 'weekly', 'lab', '.']]
>>> vocab = sorted(set(token.lower() for token in chain(*list(map(word_tokenize, [s1, s2, s3, s4])))))
>>> vocab
['.', '2', 'active', 'and', 'be', 'compare', 'content', 'difficult', 'engaging', 'exam', 'for', 'hours', 'is', 'lab', 'lecture', 'much', 'nice', 'of', 'seem', 'the', 'to', 'too', 'tutor', 'very', 'was', 'weekly', 'with']

Now lets' represent the 4 key words as vectors by using the index of the word in a vocabulary:

>>> lecture = [1 if token == 'lecture' else 0 for token in vocab]
>>> lab = [1 if token == 'lab' else 0 for token in vocab]
>>> tutor = [1 if token == 'tutor' else 0 for token in vocab]
>>> exam = [1 if token == 'exam' else 0 for token in vocab]
>>> lecture
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> lab
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> tutor
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
>>> exam
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Similarly, we loop through each sentence and convert them into a vector form:

>>> [token.lower() for token in word_tokenize(s1)]
['lecture', 'was', 'engaging']
>>> s1_tokens = [token.lower() for token in word_tokenize(s1)]
>>> s1_vec = [1 if token in s1_tokens else 0  for token in vocab]
>>> s1_vec
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

Repeating the same for all sentences:

>>> s2_tokens = [token.lower() for token in word_tokenize(s2)]
>>> s3_tokens = [token.lower() for token in word_tokenize(s3)]
>>> s4_tokens = [token.lower() for token in word_tokenize(s4)]
>>> s2_vec = [1 if token in s2_tokens else 0  for token in vocab]
>>> s3_vec = [1 if token in s3_tokens else 0  for token in vocab]
>>> s4_vec = [1 if token in s4_tokens else 0  for token in vocab]
>>> s2_vec
[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]
>>> s3_vec
[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0]
>>> s4_vec
[1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1]

Now, given the vectorial form of sentence and words, you can do use similarity functions, e.g. cosine similarity:

>>> from numpy import dot
>>> from numpy.linalg import norm
>>> 
>>> cos_sim = lambda x, y: dot(x,y)/(norm(x)*norm(y))
>>> cos_sim(s1_vec, lecture)
0.5773502691896258
>>> cos_sim(s1_vec, lab)
0.0
>>> cos_sim(s1_vec, exam)
0.0
>>> cos_sim(s1_vec, tutor)
0.0

Now, doing it more systematically:

>>> topics = {'lecture': lecture, 'lab': lab, 'exam': exam, 'tutor':tutor}
>>> topics
{'lecture': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'lab':     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'exam':    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'tutor':   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]}


>>> sentences = {'s1':s1_vec, 's2':s2_vec, 's3':s3_vec, 's4':s4_vec}

>>> for s_num, s_vec in sentences.items():
...     print(s_num)
...     for name, topic_vec in topics.items():
...         print('\t', name, cos_sim(s_vec, topic_vec))
... 
s1
     lecture 0.5773502691896258
     lab 0.0
     exam 0.0
     tutor 0.0
s2
     lecture 0.0
     lab 0.0
     exam 0.0
     tutor 0.4082482904638631
s3
     lecture 0.30151134457776363
     lab 0.0
     exam 0.0
     tutor 0.0
s4
     lecture 0.0
     lab 0.30151134457776363
     exam 0.30151134457776363
     tutor 0.0

I guess you get the idea. But we see that the scores are still tied for s4-lab vs s4-exam. So the question becomes, "Is there a way to make them diverge?" and you will jump into the rabbit hole of:

  • How best to represent the sentence/word as a fix-size vector?

  • What similarity function to use to compare "topic"/word vs sentence?

  • What is a "topic"? What does the vector actually represent?

The answer above is what is usually called a one-hot vector to represent the word/sentence. There's lots more complexity than simply comparing strings to "identify a sentences related to a topic?" (aka document clustering/classification). E.g. Can a document/sentence have more than one topic?

Do look up these keywords to further understand the problem "natural language processing", "document classification", "machine learning". Meanwhile, if you don't mind, I guess it's close for this question as "too broad".

alvas
  • 115,346
  • 109
  • 446
  • 738
  • Shameless plug in help you further in the answer above, https://www.kaggle.com/alvations/basic-nlp-with-nltk and https://drive.google.com/file/d/1lxRclJablHF-veuRzWBgJ9gaqMNo6fPa/view?usp=sharing – alvas Aug 30 '18 at 03:38
  • Thanks, this would be potential answer for me – Austin Aug 30 '18 at 07:37
0

I'm assuming you're reading from a text file or something. Here is how I would go about doing this.

keywords = {"lecture": 0, "tutor": 0, "exam": 0}

with open("file.txt", "r") as f:
  for line in f:
    for key, value in keywords.items():
      if key in line.lower():
        value += 1

print(keywords)

This searches each line for any word in your keywords dictionary, and if a match is found, it increments the value on that key.

You shouldn't need any external libraries or anything for this.

emsimpson92
  • 1,779
  • 1
  • 9
  • 24
  • Thank you, this would solve my problem. However, this might have a problem that If a sentences repeat the word "lecture" twice, it would affect to result. – Austin Aug 29 '18 at 23:33
  • Also, if I want to extend my program which can analyse the sentence is positive or negative. Do you know which library would support for that?. I already try to use nltk with scikit classifier. But only can check result that already seen. For example, I label 200 sentence that are combination of positive and negative, then let classifier learn and check how many percentage of another labelled 50 comments is correct. – Austin Aug 29 '18 at 23:41
  • You could always split the sentence further and evaluate each word if you want to find the same keyword twice in a sentence. I'm not sure about libraries... maybe you could check out the software recommendations stack exchange site? – emsimpson92 Aug 31 '18 at 17:10
0

Solution

filename = "information.txt"


library = {"lecture": 0, "tutor": 0, "exam": 0}

with open(filename) as f_obj:
    content = f_obj.read() # read text into contents

words  = (content.lower()).split() # create list of all words in content

for k, v in library.items():
    for i in words:
        if k in i:
            v += 1 
            library[k] = v # without this line code count will not update 

for k, v in library.items():
    print(k.title() + ": "  + str(v))

Output

(xenial)vash@localhost:~/pcc/12/alien_invasion_2$ python3 helping_topic.py 
Tutor: 1
Lecture: 2
Exam: 1
(xenial)vash@localhost:~/pcc/12/alien_invasion_2$

This method will count duplicates for you

Enjoy!

vash_the_stampede
  • 4,590
  • 1
  • 8
  • 20
-2

Just name the variables after the topics you want

lecture = 2
tutor = 1
exam = 1

You can use variable_name += 1 to increment the variable

Frogmonkey
  • 45
  • 5
  • 1
    This doesn't answer the question. How do they identify if a sentence contains the term? Also it would be better to use a `dict` for this imo – emsimpson92 Aug 29 '18 at 22:49
  • Sorry, I thought that when you sort a document in python the text was read as a string – Frogmonkey Aug 29 '18 at 22:51
  • @Frogmonkey yes, but you didn't explain how to read the file or how to check if each word exists in the sentence. You simply told them how to increment variables, which I'm sure they already knew – emsimpson92 Aug 29 '18 at 23:03