1

I'm a newbie in R. Is there anyone who can help me?

I import a CSV of extract of stackoverflow data from,

s <- read_csv("https://www.ics.uci.edu/~duboisc/stackoverflow/answers.csv")

Then, I separate different values in 'tags' column into rows,

ss1 <- separate_rows(ss, tags)

Then, I apply pivot_wider() on 'tags' column,

ss2 <- pivot_wider(ss1, names_from = tags, values_from = qs)

The following error messages are shown,

Error: Internal error in compact_rep(): Negative n in compact_rep(). Run rlang::last_error() to see where the error occurred. In addition: Warning messages: 1: Values are not uniquely identified; output will contain list-cols.

  • Use values_fn = list to suppress this warning.
  • Use values_fn = length to identify where the duplicates arise
  • Use values_fn = {summary_fun} to summarise duplicates 2: In nrow * ncol : NAs produced by integer overflow

I have searched the different keywords in these messages but am not able to find out the overall meaning of these errors. Is there anyone who can help me? Thanks.

Phil
  • 7,287
  • 3
  • 36
  • 66
sspoldtwo
  • 21
  • 1
  • 5
  • Thank you for your help. Since I have not do it correctly and successfully, I do not understand why there are "duplicate rows". And, values in the "X1" column already help to differentiate from duplicate rows? I have tried your suggestion as, ```ss3 <- ss1 %>% mutate(id = row_number())``` Then, pivot_wider it again as, ```ss4 <- pivot_wider(ss3, names_from = tags, values_from = qs)``` Now, getting new errors as, ```Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep()`. Run `rlang::last_error()` to see where the error occurred.``` (see next comment) – sspoldtwo Mar 26 '21 at 09:40
  • ```In addition: Warning message: In nrow * ncol : NAs produced by integer overflow``` Would you please give me further directions? Thanks. – sspoldtwo Mar 26 '21 at 09:44
  • I posted my answer just check it and let me know if this is what you were looking after or not. – Anoushiravan R Mar 26 '21 at 10:08
  • Thank you for your suggestion. It really help me on this issue. – sspoldtwo Mar 26 '21 at 10:39

3 Answers3

1

@Anoushiravan R:

Thank you very much for your kind suggestion again.

With your suggestion, I find these error messages,

> ss1 <- s %>%
+     separate_rows(tags) %>% 
+     select(qs, tags) %>%
+     group_by(tags) %>%
+     mutate(id = row_number()) %>%
+     ungroup() %>%
+     mutate(tags = if_else(tags == "", "unknown", tags))
> ss2 <- ss1 %>% pivot_wider(names_from = tags, values_from = qs, names_repair = "minimal")

Error: cannot allocate vector of size 5.4 Gb

Before, I always get another error message In nrow * ncol : NAs produced by integer overflow.

Then, I google In nrow * ncol : NAs produced by integer overflow and find that it may be in relation to the console pane. See https://github.com/wrathematics/float/issues/17

Also, I remove all the objects/datasets in "global environment" and restart RS, now I get the result as yours.

As I want to include ALL columns in the result, I remove "select(qs, tags) %>%" from your suggestion with the following codes and errors,

> ss1 <- s %>%
+     separate_rows(tags) %>% 
+     
+     group_by(tags) %>%
+     mutate(id = row_number()) %>%
+     ungroup() %>%
+     mutate(tags = if_else(tags == "", "unknown", tags))
> View(ss1)
> ss2 <- ss1 %>% pivot_wider(names_from = tags, values_from = qs, names_repair = "minimal")

Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep()`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In nrow * ncol : NAs produced by integer overflow

The In nrow * ncol : NAs produced by integer overflow appears again.

I google the first major error, Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep() and cannot find a good answer.

I also try different combination with "group_by" but cannot get a satisfactory result. Anyway thank you very much for your help.

sspoldtwo
  • 21
  • 1
  • 5
  • I also received the same error and I think it has something to do with the memory limit of R and when I restarted my R session I got the result. But I have no idea on how to fix this. In the end I am at least happy that we got the result to some extent that was of interest to you. My pleasure and your welcome. – Anoushiravan R Mar 26 '21 at 15:36
  • 1
    The hyperlink in previous comment and google suggest that R loads the ENTIRE dataset in the RAM, so the limitation of the size depends on YOUR physical size of RAM. Thus, when I clean up the unwanted dataset in RS, we can get the result. – sspoldtwo Mar 27 '21 at 00:43
  • This number ```2^31 - 1``` always come to my search results. After reading https://stackoverflow.com/a/48676389/15484790 and https://stackoverflow.com/a/5234293/15484790, I think these may be the answer of the major error. In my dataset, the total no. of rows are 1,007,855, and the no. of columns after pivot_wider() will be 10,534. So row x col = 10,616,744,570 which is much more than ```2^31 - 1 = 2,147,483,647``` – sspoldtwo Mar 27 '21 at 02:25
  • Yes I understand, So in the end was the output the one you were looking for? I guess if you do the same data manipulation in SQL you won't get the error. – Anoushiravan R Mar 27 '21 at 09:45
  • Yes, these are what I intended to do. At least, I learnt the limitations now. – sspoldtwo Mar 27 '21 at 10:43
  • Glad to hear that then. It really took a couple of minutes for my system to load that data set. – Anoushiravan R Mar 27 '21 at 13:02
0

Ok I edited my solution, I hope this is something you were looking for. This time I used separate_rows as per your suggestion to separate the values stacked in every rows in tags column. Run the following code and then let me know if there is anything else you need.

s %>%
  separate_rows(tags) %>% 
  select(qs, tags) %>%
  group_by(tags) %>%
  mutate(id = row_number()) %>%
  ungroup() %>%
  mutate(tags = if_else(tags == "", "unknown", tags)) %>%
  pivot_wider(names_from = tags, values_from = qs, names_repair = "minimal")


# A tibble: 68,384 x 10,522
      id   php error    gd image processing  lisp scheme subjective clojure cocoa touch
   <int> <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl>  <dbl>      <dbl>   <dbl> <dbl> <dbl>
 1     1     0     0     0     0          0    10     10         10      10     0     0
 2     2     0     0     0     0          0    10     10         10      10     0     0
 3     3     1     0     1     1          1    10     10         10      10     0     0
 4     4     1     2     0     1          1    10     10         10      10     1     1
 5     5     1     2     0     1          1    10     10         10      10     0     0
 6     6     2     2     1     1          1    10     10         10      10     1     1
 7     7     2     2     1     0          1    10     10         10      10     1     1
 8     8     2     2     0     0          1    10     10         10      10     3     3
 9     9     0     2     0     0          1    10     10         10      10     3     3
10    10     0     2     0     0          1    10     10         10      10     3     3
# ... with 68,374 more rows, and 10,510 more variables

Since data here is a bit heavy I suggest you first run the code until pivot_wider and then run pivot_wider line. I don't know why but only in this way I get the desired output otherwise I receives an error.

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • Thank you for your help. Your suggestion is what I intend to do, except I want to create new columns with values in "tags". That's why I use ```ss1 <- separate_rows(ss, tags)``` first. Since I am trying with pivot function, I do not care the NA at this moment. Thanks for your reminder. I am not understand your use of "group_by", as the "id" is now the id of the ALL individual members in group starting from 1 which is not UNIQUE. Am I correct? But I am still thinking how it could get this result. With your final comment that inspires me, I try the following but still fail, – sspoldtwo Mar 26 '21 at 10:32
  • `ss1 <- ss %>% separate_rows(tags)` `ss2 <- ss1 %>% mutate(id=c(1:nrow(ss1)))` `ss3 <- ss2 %>% pivot_wider(names_from = tags, values_from = qs)` Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep()`. Run `rlang::last_error()` to see where the error occurred. In addition: Warning message: In nrow * ncol : NAs produced by integer overflow – sspoldtwo Mar 26 '21 at 10:34
  • Sorry for one more question, from your last comment, "This error usually happens when the combination of column values are not unique so we need to create unique ids for each row before you apply pivot_wider." If I CREATE a new RAW data frame, is it correct that I cannot have duplicate rows, ie, rows with exactly same values in all columns? – sspoldtwo Mar 26 '21 at 10:38
  • From your comment, so far I could only work on this, `ss1 <- s %>% group_by(tags) %>% mutate(id = row_number()) %>% separate_rows(tags)`. After that, errors shown again if try to pivot it. – sspoldtwo Mar 26 '21 at 10:53
  • No it's ok let me see what I can do. – Anoushiravan R Mar 26 '21 at 10:54
0

This is a bug in R, or a limitation, whatever we call it there is no direct solution for it. This is the essence of the error:

`a <- 1000000L
 b <- 2000000L 
 a * b` 

It yields NA with a warning: In a * b : NAs produced by integer overflow

I have circumvented the issue by a new approach, not as neat as direct as using separate_row() and then `pivot_longer(), but it works!

This is the idea:

  1. find all the unique (hash)tags save them in a vector
  2. loop through the vector and str_detect() the elements in the original text
  3. You will have a logical vector for each tag as the result of 2, bind_cols() them.

Actually 2&3 are implemented in a loop.

For 1, you can use the separate_row() and then distinct() the tags column, then pull it out of the tbl.

Shaahin
  • 27
  • 2