0

I have a database including certain strings, such as '{TICKER|IBM}' to which I will refer as ticker-strings. My target is to count the amount of ticker-strings per day for multiple strings.

My database table 'tweets' includes the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss) and 'processed text'. The ticker-strings, such as '{TICKER|IBM}', are within the 'processed text' row.

At this moment, I have a working SQL query for counting one ticker-string (thanks to the help of other Stackoverflow-ers). What I would like to have is a SQL query in which I can count multiple strings (next to '{TICKER|IBM}' also '{TICKER|GOOG}' and '{TICKER|BAC}' for instance).

The working SQL query for counting one ticker-string is as follows:

SELECT d.date, IFNULL(t.count, 0) AS tweet_count
FROM all_dates AS d
LEFT JOIN (
    SELECT COUNT(DISTINCT tweet_id) AS count, DATE(created_at) AS date
    FROM tweets
    WHERE processed_text LIKE '%{TICKER|IBM}%'
    GROUP BY date) AS t
ON d.date = t.date

The eventual output should thus give a column with the date, a column with {TICKER|IBM}, a column with {TICKER|GOOG} and one with {TICKER|BAC}.

I was wondering whether this is possible and whether you have a solution for this? I have more than 100 different ticker-strings. Of course, doing them one-by-one is an option, but it is a very time-consuming one.

Geoffrey
  • 27
  • 1
  • 6
  • The term you are looking for is 'pivot' (turning several rows in a few columns into one row with many columns). You should be able to find plenty of uses of it on Stackoverflow with a search or two – Twelfth Jun 30 '14 at 20:14
  • It is a little complicated, but I believe to generalize this query for all ticker symbols you need to create a pivot table query. Also, using wildcards, ex. '%{TICKER|IBM}%' will force a serial read on table tweets and impair performance. In this case I think you can use LEFT(processed_text, Length('{TICKER|')) = '{TICKER|' to find tickers. Her's an SO example of dynamic pivot tables in mysql: http://stackoverflow.com/questions/12598120/mysql-pivot-table-query-with-dynamic-columns – ron tornambe Jun 30 '14 at 20:39

2 Answers2

1

If I understand correctly, you can do this with conditional aggregation:

SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG, coalesce(BAC, 0) AS BAC
FROM all_dates d LEFT JOIN
     (SELECT DATE(created_at) AS date,
             COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
                   END) as IBM,
             COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
                   END) as GOOG,
             COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
                   END) as BAC
      FROM tweets
      GROUP BY date
     ) t
     ON d.date = t.date;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks Gordon, this works perfectly! Typing in the names of the companies in the script costs way less time than doing all 100+ ticker-strings manually, so I really appreciate it!! – Geoffrey Jun 30 '14 at 20:57
0

I'd return the specified resultset like this, adding expressions to the SELECT list for each "ticker" I want returned as a separate column:

   SELECT d.date
        , IFNULL(SUM(t.processed_text LIKE '%{TICKER|IBM}%' ),0) AS `cnt_ibm`
        , IFNULL(SUM(t.processed_text LIKE '%{TICKER|GOOG}%'),0) AS `cnt_goog`
        , IFNULL(SUM(t.processed_text LIKE '%{TICKER|BAC}%' ),0) AS `cnt_goog`
        , IFNULL(SUM(t.processed_text LIKE '%{TICKER|...}%' ),0) AS `cnt_...`
     FROM all_dates d
     LEFT
     JOIN tweets t
       ON t.created_at >= d.date
      AND t.created_at < d.date + INTERVAL 1 DAY
    GROUP BY d.date

NOTES: The expressions within the SUM aggregates above are evaluated as booleans, so they return 1 (if true), 0 (if false), or NULL. I'd avoid wrapping the created_at column in a DATE() function, and use a range scan instead, especially if a predicate is added (WHERE clause) that restricts the values ofdatebeing returned fromall_dates`.

As an alternative, expressions like this will return an equivalent result:

     , SUM(IF(t.process_text LIKE '%{TICKER|IBM}%' ,1,0)) AS `cnt_ibm`
spencer7593
  • 106,611
  • 15
  • 112
  • 140