0

I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.

My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1. At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.

    SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
    FROM `tweets` 
    WHERE `processed_text` LIKE '%{TICKER|IBM}%'
    GROUP BY Date( `created_at` )

There are however two problems with this script:

  1. It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
  2. I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.

When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:

SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
 (SELECT DATE(created_at) AS date,
         COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
               END) as IBM,
         COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
               END) as GOOG,
         COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
               END) as BAC
  FROM tweets
  GROUP BY date
 ) t
 ON d.date = t.date;

This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?

Geoffrey
  • 27
  • 1
  • 6

1 Answers1

0
SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
    (SELECT DATE(created_at) AS date,
            SUBSTR(processed_text, 
                   LOCATE('{TICKER|', processed_text) + 8,
                   LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
                    - LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker

This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.

See MySQL pivot table for how to do it in MySQL.

Community
  • 1
  • 1
Barmar
  • 741,623
  • 53
  • 500
  • 612