Error in tapply(y, by, sum) : total number of levels >= 2^31

Question

I get the main idea behind this error, my data is too big 512218 records with 3 variables and I'm trying to convert the dataframe to tabular format so I can get adjacency matrix. Right now I'm using xtabs and getting this error

n <- xtabs(USER_LINK ~ screenName + screen_name_mention, df)

I tried using sapply(df,table) (as mentioned in a related question) but it didn't work. What I want to know is there an alternative way to convert dataframe to tabular format without getting this error?

head of data

 screenName    screen_name_mention   USER_LINK
1  g_fandos       ecolandlab            1
2 andrewmbrass    PLOSBiology           1
3 andrewmbrass    PLOSBiology           1
4  welloldstem     dbcurren             1
5 PaulJDavison     BehavEcol            1
6  cbjones1943     BiolJLinnSocÂ¿       1

str(df)

'data.frame':   512218 obs. of  3 variables:
 $ screenName         : Factor w/ 150233 levels "","#$%","#cuttingeeg",..: 50920 8866 8866 145600 106833 23847 23847 98575 98575 61282 ...
 $ screen_name_mention: Factor w/ 150233 levels "","#$%","#cuttingeeg",..: 41276 110025 110025 33531 15579 17454 61209 112371 38473 110091 ...
 $ USER_LINK          : int  1 1 1 1 1 1 1 1 1 1 ...

Example:

 User_name  M_User  Total
   user 1  user 2     7
   user 1  user 3    19
   user 1  user 7     5
   user 3  user 2     1
   user 2  user 7     1

End Results

User_name user 1 user 2 user 3 user 7
   user 1      0      7     19      5
   user 2      0      0      0      1
   user 3      0      1      0      0
   user 7      0      0      0      0

My code works fine for small dataset like this (even creates 5000x5000 matrix) but not for large dataset

Thanks @melissa. Could you please share `str` of your data as well? — MKR, Dec 26 '17 at 20:54
@Onyambu i have updated the question with an example of what I need — melissa, Dec 26 '17 at 21:26
you adjacency matrix will be large (150233*150233), so try setting `sparse=TRUE` in `xtabs` to see if it helps. Or create a graph using `igraph` and output a sparse adj matrix?? — user20650, Dec 26 '17 at 21:58
can you give a try to `reshape2::dcast(df,User_name~M_user,fill = 0)`? — Onyambu, Dec 26 '17 at 22:42
can't convert a sparse matrix to normal matrix it gives error @user20650 — melissa, Dec 28 '17 at 12:59
@Onyambu It does't detect USER_LINK for some odd reasons, there are no spelling mistakes :/ it gives the error `Error in match(x, table, nomatch = 0L) : object 'USER_LINK' not found` — melissa, Dec 28 '17 at 13:02
@melissa ; are you trying to convert the sparse matrix to a dense matrix using all the data? if so, why? The reason to use a sparse matrix is so that it uses much much less memory. A 150233*150233 matrix is very unlikely to fit in memory (150233*150233*8/2^30 = 168GB), but you adjacency matrix is 99.99% sparse (100 * (1 - 512218 / (150233*150233))). It wastes space to explicitly store the zeros, so the sparse matrix representation only stores the non-zero elements - hence it fits in memory — user20650, Dec 28 '17 at 13:26

MKR · Answer 1 · 2017-12-26T22:00:06.440

One option is to use spread function from tidyr package.

If you apply spread function on the data-set example provided by you:

df <- data.frame(User_name = c("user 1", "user 1", "user 1", "user 3", "user 2"),
                 M_user = c("user 2", "user 3", "user 7", "user 2", "user 7"),
                 Total = c(7, 19, 5, 1, 1)
                 )

>df
User_name  M_User  Total
user 1  user 2     7
user 1  user 3    19
user 1  user 7     5
user 3  user 2     1

#spread will convert the data as below:
> spread(df, M_user, Total)
  User_name user 2 user 3 user 7
1    user 1      7     19      5
2    user 2     NA     NA      1
3    user 3      1     NA     NA

I normally prefer to break big data horizontally during analysis to understand nature of data in a better. For example, in OP data is more than 500+ K which is blocking any analysis. I would prefer to break data into, say, 5 parts as:

df[1:100000,]
df[100001:200000,]
df[200001:300000,]
df[300001:400000,]
df[400001:512218,]

Analyse those 5 subsets (one can break those in even smaller sets) first, that may give you a better understanding of data to apply rules. It is also possible that one can think of analyzing smaller sets first and then finding an easy way to combine the results.

Error in tapply(y, by, sum) : total number of levels >= 2^31

1 Answers1

Linked