2

Let's assume we have a database like:

Actions_tbl:

--------------------------------------------------------
id | Action_name                              | user_id|
--------------------------------------------------------
1  |  John reads one book                     | 1     
2  |  reading the book by john                | 1
3  |  Joe is jumping over fire                | 2
4  |  reading another book                    | 2
5  |  John reads the book in library          | 1
6  |  Joe read a    book                      | 2
7  |  read a book                             | 3
8  |  jumping with no reason is Ronald's habit| 3 

Users_tbl:

-----------------------
user_id |    user_name |
-----------------------
1       |     John
2       |     Joe
3       |     Ronald
4       |     Araz
-----------------------

Wondering if I can choose the most repeated similar action regardless of it's user and replace my own user_name with its current user!

Read one book, reading the book, reading another book, read the book in library, read a book and read a book are the ones who have most common WORDS so the staffs related to reading the book is repeated 6 times, my system should show one of those six sentences randomly and replace Araz with user_name

Like: Araz reads the book

My Idea was to

select replace(a.action_name , b.user_name) from actions_tbl a, user_tble b where a.user_id = b.user_id group_by

and then check the similarities one by one in php using

levenshtein()

But this one doesn't have performance at all!

Assume that I want to do the same thing for a big db and for few different tables. This will destroy my server!!!

Any better IDEA?

in http://www.artfulsoftware.com/infotree/queries.php#552 the levenshtein() function is implemented as a MySQL function but firstly, do u think it has enough performance? and then, how to use it in my case? Maybe a self-join van fix this issue but I'm not that good with sql!

* similar action, are the actions that have more than X% common words


** More information and notes:**

  1. I'm limited to PHP and MySQL.

  2. This is just an example, in my real project the actions are long paragraphs. That's why the performance is a matter. The real scenario is: user inputted the description of its project for several projects, those data may be too similar(users would have the same area of work), I want to fill automatically(base on previous fillings) the description of next project, to save time.

  3. I would appreciate if you can have any pragmatical Solution. I checked the NLP related solutions, although they r interesting, but I don't think many of them can be accurate and can be implemented using PHP.

  4. The output should make sense and be a proper paragraph like all other projects. That's why I was thinking of choosing from previous ones.


Thanks for your intellectual answers, its really appreciated if you could shed some light on the situations

Nil Null
  • 414
  • 5
  • 14

2 Answers2

2

What you are talking about is a text clustering process. You are trying to find similar pieces of text, and arbitrarily choosing one of them. I am not familiar with any database that does this form of text mining.

For what you describe, a pretty basic text mining technique would probably work. Create a term-document matrix with all the words except the user names. Then use singular value decomposition to get the largest singular value and vector (this is the first principal component of the correlation matrix). The similar activities should cluster along this line.

If you have a limited vocabulary and have the terms in a table, you could measure distance between two actions by the proportion of words that overlap. Do you have a list of all words in the actions?

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • thanks for suggestion, this table was just a sample, in fact in reality, I have a table that contains long paragraphs, each paragraph can be more than 10 lines and table would have many of those! Do u think its pragmatical to list all words and do what u r saying? is there any sample in PHP ? – Nil Null Jul 18 '12 at 15:46
  • In addition I'm implementing some sort of auto filling , so system can fill the forms base on the data that user provided before. – Nil Null Jul 18 '12 at 15:53
  • You have a more complicated problem than is usually solved by databases directly. You need an application. If you are using SAS, you might look at SAS Text Miner, for instance. If you have the list of forms, and want to find the one closest to each paragraph, you can probably do that with a bag-of-words approach. In any case, you have a problem that relational databases were not designed to solve (although they can be part of the solution). – Gordon Linoff Jul 18 '12 at 15:57
  • The problem is that I'm limited to MySQL and PHP :( – Nil Null Jul 19 '12 at 03:10
1

First off, you'll have to decide whether you want to compare a given input to all existing texts, or do a pairwise comparison of all texts. Your question asks for the latter, but the application you outline sounds more like the former.

If you compare only a single input with your database, I then I'd have hoped levenshtein distance computation to be fast enough up to medium database sizes. And there probably will be few ways to make things any faster unless you store some form of intermediate data structure about the current content of your text base. Recomputing anything for every new input will probably be just as costly.

If you want to do a comparison for every pair, then a levenshtein computation for each of them will take too much time. You'll have to devise some other concept of similarity. The first thing that comes to my mind, which would be somewhat resilient to different forms of a word, would be a suffix tree. You could insert all paragraphs into that tree. Where suffix trees normally store a single pointer, you might want to store a pair of indices, one identifying the database row and the other denoting a position in the text of that row. After building the tree, you could traverse it to identify common substrings, and increment some similarity counter for the corresponding pair. You'll have to experiment a bit to tune this measure. You might want to impose a minimum length for a common string before you increment a counter. As long texts have a larger chance of common words even if they are semantically unrelated, you might have to compensate for length in some way. I doubt there is a canonical way to do this.

The term-document matrix approach suggested by Gordon sounds interesting as well, and you should be able to implement that in PHP, too. That approach will be mor sensitive to changes of word form, even if the root is the same. On the other hand, it might be easier to keep a suitable matrix for that stored in your database, and to keep that structure in sync when you update your main text table. Both of these approaches have a fundamental difference to levenshtein distance: they care less about the overall order. I belive that this is a good thing in your case, because they'll consider the texts “John read a book after he went swimming in the lake” more similar to “After swimming in the lake, Joe read a book” than levenshtein distance would.

Your example indicates that you not only want to rank similarities, but also decide on cluser boundaries, I.e. say “these form a group” and “those belong to distinct groups”. There won't be a clean cut-off for this, so you'll have to experiment with heuristics for that as well. Unless always chosing the most similar text, or the k most similar texts, is enough for your application. In any case, I'd concentrate on the similarity computation first, and add things like user name replacement later on.

Community
  • 1
  • 1
MvG
  • 57,380
  • 22
  • 148
  • 276
  • Thanks for your comment, yes I want to compare all rows in my table to find most common ones and then choose one of them as output. – Nil Null Jul 26 '12 at 09:46
  • here, a proper ranking is not needed (unlike the http://stackoverflow.com/questions/11609348/advance-query-rank-most-related-fields-in-mysql) We may can apply the same thing like what u mentioned there. A query to check all rows (by left joining) and find similar rows by running a procedure (if rows are similar it returns true). The matter is how to write that procedure? I can't make the dictionary cause the words are not limited and I don't have access to the inserting event (my app is a plug-in) The data is inputted using another application – Nil Null Jul 26 '12 at 10:41
  • About choosing, selecting randomly from the common similar rows are enough. SELECT * FROM `table` ORDER BY RAND() LIMIT 0,1; where table is the most common similar rows. [Read one book| reading the book| reading another book| read the book in library| read a book | read a book] in our example. – Nil Null Jul 26 '12 at 10:54