-3

Using the file posts.csv in the data folder of this repo (the sample of 10,000 public Facebook posts by members of the US congress from 2017), solve the following with dplyr:

  • Do not consider posts with zero likes

  • Compute the comment to like ratio (i.e. comments_count / likes_count) for each post and store it in a column clr

  • For each screen_name, compute a normaliser_based_on_even_months = max(clr) - min(clr), i.e. the maximum minus the minimum clr value of posts by that screen_name, however, only taking into account the posts made in even months, i.e. posts made in in February, April, June, August, October, December when computing max(clr) - min(clr) for each screen_name

  • Set all normaliser_based_on_even_months that have a value of zero to NA or delete them

  • Afterwards create a column normalised_clr which stores the clr of all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by the normaliser_based_on_even_months of the associated screen name. The only exception are posts from screen names that had a normaliser_based_on_even_months value of zero and were deleted/set to NA before -- for these posts just set the value in normalised_clr to NA as well or drop the post from the final data frame.

  • In other words, the value of a single post/line i (written by a politician p) in that normalised_clr column can be computed as: normalised_clr_{i,p} = clr{i}/normaliser_based_on_even_months_{p} for all observations for which there is a non-NA normaliser_based_on_even_months (no need to use a loop for this though, dplyr allows to compute it in a vectorised way)

  • Keep only those rows with normalised_clr > 0

  • Arrange the data frame according to normalised_clr in ascending order

  • Print out only screen_name and normalised_clr for the first 10 rows, i.e. the posts with the 10 lowest normalised_clr

I have written an R program for this. However, my task is to convert that into a SINGLE SQLite query. Here's what I've done:

# Create database
posts_db <- dbConnect(RSQLite::SQLite(), "C:/Users/HP/Documents/posts.sqlite")

# Reading the first file into R
data <- read.csv("C:/Users/HP/Documents/posts.csv",
    stringsAsFactors = FALSE)

dbWriteTable(posts_db, "posts", data, overwrite = TRUE)

dbListFields(posts_db, "posts")

dbGetQuery(posts_db,"WITH 
cte1 AS (SELECT screen_name, comments_count*1.0/likes_count AS clr, 
strftime('%m', date) AS month FROM posts WHERE likes_count>0), 
cte2 AS (SELECT (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months 
FROM cte1 
WHERE month % 2 = 0
GROUP BY screen_name),
cte3 AS (SELECT screen_name, clr, normaliser_based_on_even_months,
clr/normaliser_based_on_even_months AS normalized_clr FROM cte1, cte2 
WHERE normaliser_based_on_even_months>0)
SELECT screen_name, normalized_clr FROM cte3 
WHERE normalized_clr>0 
ORDER BY normalized_clr")

Here's the output:

               screen_name normalized_clr
1             repmarkpocan   0.0002546821
2             repmarkpocan   0.0002690018
3  CongressmanRalphAbraham   0.0002756995
4  CongressmanRalphAbraham   0.0002912010
5             derek.kilmer   0.0003549631
6           RepJimMcGovern   0.0003664136
7      CongresswomanNorton   0.0003687929
8             derek.kilmer   0.0003749212
9           RepJimMcGovern   0.0003870155
10     CongresswomanNorton   0.0003895286

Sample Data

For reference here is the R code that produces the result I want the SQLite Query to produce:

posts <- read.csv("C:/Users/HP/Documents/posts.csv")

#Remove columns with zero likes
posts <- posts %>% filter(likes_count > 0)

#create 'clr' which is the comment to like ratio
posts <- posts %>% mutate(clr = comments_count / likes_count) 

#Compute the maximum minus the minimum `clr` value of posts by that `screen_name`, however, only taking into account __the posts made in even months, i.e. posts made in in February, April, June, August, October, December__ when computing `max(clr) - min(clr)` for each `screen_name`. Code from https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group

posts$date <- ymd(posts$date)
posts$date <- month(posts$date)


posts_normaliser <- posts %>% group_by(screen_name) %>% mutate(normaliser_based_on_even_months = case_when(date%%2==0 ~ (max(clr) - min(clr))))


#Set all `normaliser_based_on_even_months` that have a value of zero to NA or delete them
posts_normaliser <- posts_normaliser %>% filter(normaliser_based_on_even_months > 0)

#Afterwards create a column `normalised_clr` which stores the `clr` of all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by the `normaliser_based_on_even_months` of the associated screen name. 

merged_df <- merge(posts, posts_normaliser)

merged_df <- merged_df %>% group_by(screen_name)%>% mutate(normalised_clr = clr / normaliser_based_on_even_months)

#Keep only those rows with `normalised_clr` \> 0

merged_df <- merged_df %>% filter(normalised_clr > 0)

#Arrange the data frame according to `normalised_clr` in ascending order

merged_df <- merged_df %>% arrange(normalised_clr)

#Print out only `screen_name` and `normalised_clr` for the first 10 rows, i.e. the posts with the 10 lowest `normalised_clr`

merged_df[1:10, c("screen_name", "normalised_clr")]

Here's the Output from R:

> merged_df[1:10, c("screen_name", "normalised_clr")]
# A tibble: 10 × 2
# Groups:   screen_name [5]
   screen_name                   normalised_clr
   <chr>                                  <dbl>
 1 CongresswomanSheilaJacksonLee        0.00214
 2 CongresswomanSheilaJacksonLee        0.00218
 3 CongresswomanSheilaJacksonLee        0.00277
 4 RepMullin                            0.00342
 5 SenDuckworth                         0.00342
 6 CongresswomanSheilaJacksonLee        0.00357
 7 replahood                            0.00477
 8 SenDuckworth                         0.00488
 9 SenDuckworth                         0.00505
10 RepSmucker                           0.00516

I keep going through the statements but I can't figure out why the outputs are different.

Jay
  • 19
  • 7
  • It would help immensely if you could provide sample data and expected output for that data, see https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info for good discussions about possible methods using `dput`, `data.frame`, and `read.table(text=..)`. Thanks! – r2evans Jan 18 '23 at 22:05
  • How do you expect people to use your sample data when it is in the form of an image? The output is short enough to post as text in the question. The R code is not relevant to the question of the correct SQL query. – Mark Benningfield Jan 19 '23 at 12:48
  • Please do not post (only) an image of code/data/errors: it breaks screen-readers and it cannot be copied or searched (ref: https://meta.stackoverflow.com/a/285557 and https://xkcd.com/2116/). Please include the code, console output, or data (e.g., `data.frame(...)` or the output from `dput(head(x))`) directly. – r2evans Jan 19 '23 at 13:04
  • Jay, you're asking us to either (1) transcribe from an image data that you have readily available in non-image form; or (2) know as much as needed about the data just from looking at an image. You should want to make it as easy for as as possible to play with your data in its correct form, and while your first image is a tibble (and includes class information) and therefore far better than the second (data.frame, no idea what each column's class _really_ is, so we have to guess, which often gets it wrong), there's still the point of having to transcribe it. I choose to not spend the time. – r2evans Jan 19 '23 at 13:07
  • Alright I'll post a snippet of the data soon. And the output. Thanks – Jay Jan 19 '23 at 14:48
  • I'd like to apologize I didn't mean 4o make your life harder. I've added in the full dataset on GitHub and I'm updating the output right now. – Jay Jan 19 '23 at 15:01
  • Hi Thanks for your suggestion. I was able to do the Query. But it seems that the calculation is wrong. I think it's probably because when calculating CLR R gives me Float values as it should but sql seems to only give Int values. Probably because likes_count and comments_count used to calculate CLR are both Int. How do I explicitly tell it what Data type to use – Jay Jan 19 '23 at 16:54
  • I was told not to create another question for this please take a look at the edited question. Thanks – Jay Jan 19 '23 at 18:16
  • @r2evans Can you please take a look? – Jay Jan 19 '23 at 18:22
  • 1
    For your apology, you aren't making my life harder at all, my involvement on SO is completely voluntary and when-able; I have brief snapshots to look at things, and if a question takes more than "so-many" seconds, I'm out. I think I'm not alone in that, I think quick answers come to questions that are concise, well-supported, and reduce the cost-of-entry. Transcribing is one such cost of entry that is easy. Another cost of entry for me is the complexity of the constraints, and this one is not simple: the fact that one (of eight!) constraints is a large paragraph that I don't fully understand. – r2evans Jan 19 '23 at 18:30
  • @ r2evans Let me go back and make the description clearer. this is what I had got from the professor – Jay Jan 19 '23 at 18:33
  • @r2evans I have edited what I think was problematic please take a look. – Jay Jan 19 '23 at 18:40
  • Hell I'm very new to this. I did not know that StackOverflow does not support LaTex – Jay Jan 19 '23 at 18:44

1 Answers1

1

If you look at the output from your SQL query, it results in 4,543,128 rows out of the original 10,000, indicating that your joins are exploding. You need to join on something; since all of the tables you're joining are either (a) original rows or (b) summarized, unique screen_name, I think we can use that column.

I modified your query to do two things:

  • add screen_name to cte2
  • update the join inside cte3
  • add limit 10 since you said "all into a single query", and "lowest 10" is part of the constraints.
WITH
cte1 AS (
  SELECT screen_name, comments_count*1.0/likes_count AS clr,
    strftime('%m', date) AS month
  FROM posts WHERE likes_count > 0
),
cte2 AS (
  SELECT screen_name, (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months
  FROM cte1
  WHERE month % 2 = 0
  GROUP BY screen_name
),
cte3 AS (
  SELECT cte1.screen_name, clr, normaliser_based_on_even_months,
    clr/normaliser_based_on_even_months AS normalized_clr
  FROM cte1
    LEFT JOIN cte2 on cte1.screen_name = cte2.screen_name
  WHERE normaliser_based_on_even_months > 0
)
SELECT screen_name, normalized_clr
FROM cte3
  WHERE normalized_clr > 0
ORDER BY normalized_clr
LIMIT 10

This produces:

      screen_name normalized_clr
1    SenDuckworth    0.002318900
2       RepMullin    0.003415301
3    SenDuckworth    0.003425708
4  repmikecoffman    0.003861004
5    SenDuckworth    0.004173566
6    SenDuckworth    0.004880627
7    SenDuckworth    0.005035819
8    SenDuckworth    0.005051889
9    SenDuckworth    0.005112120
10        RepBost    0.005600815
r2evans
  • 141,215
  • 6
  • 77
  • 149