-1

I have a text column which contains hashtags used by users. each row contains a different number of hashtags, separed by a space, like this:

USERS |  HASHTAG COLUMN:
------------------------
user1 | hashtag1 hashtag2
user2 | hashtag2
user1 | hashtag1 hashtag2 hashtag3 hashtag4

I want to get the most used hashtags, in this case my desired output should be:

OCCURRENCES | TAG
----------------------
3           | hashtag2
2           | hashtag1
1           | hashtag3
1           | hashtag4

I have NO IDEA how to get it, any help is much appreciated. Thank you

silk
  • 129
  • 7
  • 3
    You're in for a world of pain: To do what you want with the current design, you'll have to write a parser for that field. You'll need a slight database redesign to get past this and other future problems, because your table is not in first normal form. Look here: https://en.wikipedia.org/wiki/First_normal_form – Jeremy Smyth Dec 19 '13 at 18:04
  • take a look at this http://stackoverflow.com/questions/1096679/can-mysql-split-a-column – Tin Tran Dec 19 '13 at 18:42

1 Answers1

1

Assuming you can't redesign your database to be in 1NF, then you can do this in bash:

echo "select hashtag from table" | \
mysql --user=foo --password=bar --host=hostname --database=dbname --skip-column-names | \
sed -e 's/ /\n/g' | \
sort | \
uniq -c | \
sort -rn

The sed command puts each hashtag on its own line. The first sort command puts all the duplicate hashtags next to each other so that ... the uniq command can count all the occurrences of each one. The second sort command orders the output in reverse numerical order by the counts.

dg99
  • 5,456
  • 3
  • 37
  • 49