3

I have data that looks like this:

            movie_id    comment
            1           tom cruise is great
            1           great action movie
            2           got teary eyed
            2           great cast
            1           tom cruise is hott

I'd like a function that returns the most common words in the comments, based on what movie_id I select. So if I'm querying movie_id=1, I'd get:

            tom, 2
            cruise, 2
            is, 2
            great, 2
            hott, 1
            action, 1
            movie, 1

While if I query movie_id=2, I'd get:

            got, 1
            teary, 1
            eyed, 1
            great, 1
            cast, 1

I saw some solutions using tsql, but I've never used that before and didn't understand the code. Looking for a way to do this in sqlite3.

user1956609
  • 2,132
  • 5
  • 27
  • 43
  • There is no `Split` function or equivalent in SQLite, so this can't be done in a simple query, you'll likely need to write some code. Maybe do some work with the Subtr() function – Scotch Apr 04 '13 at 19:59

2 Answers2

4

You can do this with a really ugly query.

select word, count(*) from (
select (case when instr(substr(m.comments, nums.n+1), ' ') then substr(m.comments, nums.n+1)
             else substr(m.comments, nums.n+1, instr(substr(m.comments, nums.n+1), ' ') - 1)
        end) as word
from (select ' '||comments as comments
      from m
     )m cross join
     (select 1 as n union all select 2 union all select 3
     ) nums
where substr(m.comments, nums.n, 1) = ' ' and substr(m.comments, nums.n, 1) <> ' '
) w
group by word
order by count(*) desc

This is untested. The inner query needs a list of numbers (limited to just 3 here; you can see how to add more). It then checks to see if a word starts at position n+1. A word starts after a space, so I put a space at the beginning of the comments.

Then it pulls the word out, for aggregation purposes.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • M is the movie_id I'm guessing. What is nums? – user1956609 Apr 04 '13 at 20:18
  • @user1956609 . . . Nums is a table that just contains integers starting at 1. I am using it to test offsets into the comments, to determine if a word is starting. In this case it has only 3 numbers. You would want more like 50 or 100. You may already have a numbers table available (say the primary id in another table), and you could use that instead. – Gordon Linoff Apr 04 '13 at 20:20
  • Gah, I've tried instr and charindex in sqlite3 and neither is recognized as a function. Is there another way to get this functionality or do I need to add a custom function? – user1956609 Apr 04 '13 at 21:01
  • @user1956609 . . . It may depend on the version, but it is documented here . . . http://www.sqlite.org/lang_corefunc.html. – Gordon Linoff Apr 04 '13 at 21:02
  • Okay I updated my sqlite version and got it to run without errors... but nothing is returning. I'm new to instr/substr and having a hard time getting intuition for what some of this is doing. Is there by chance a typo? I'm not sure what 'as word' or 'w' are doing. – user1956609 Apr 04 '13 at 22:46
0

This a hack, but you can also use SQLite's Table-valued Functions to solve this:

select value from (
    select
        '["' || replace(trim(title), '"', '') || '"]' as arr
    from story limit 2500000
    ) e, json_each(e.arr)

Watch out though - you might notice that you have a lot of characters in your column that may break JSON parsing and throw a JSON parsing error.

In that case, you might have to remove some control characters:

select value from (select
'["' || replace(replace(replace(replace(replace(replace(trim(title), '"', ''), CHAR(10), ""), CHAR(13), ''), "  ", ""), "\", ""), ' ', '","') 
|| '"]' as arr
from your_table limit 2500000
) e, json_each(e.arr)

To make it easier to debug, you can use a combo of limit and offset (eg. limit 1000000 offset 2200000) to help you find the problem rows and then replace that character with a an empty string.

oxuser
  • 1,257
  • 2
  • 16
  • 23