1

I'm sorry if this is a duplicate question, but I have looked around at similar problems and haven't been able to find a real solution. Anyway, here goes:


I've read a .csv file into a table. There I'm dealing with 3 columns: "ID"(author's ID), "num_pub"(number of articles published), and "year"(spans from 1930 to 2017).

I would like to get a final table where I would have "num_pub" for each "year", for every "ID". So rows would be "ID"s, columns would be "year"s, and underneath each year there would be the corresponding "num_pub" or 0 value if the author hasn't published anything.


I have tried creating two new tables and merging them in a few different ways described here but to no avail.


So first I read my file into a table: tab<-read.table("mytable.csv",sep=",",head=T,colClasses=c("character","numeric","factor"))

head(tab,10) ID num_pub year 1 00002 1 1977 2 00002 2 1978 3 00002 1 1983 4 00002 4 1984 5 00002 3 1990 6 00002 1 1994 7 00002 2 1996 8 00004 3 1957 9 00004 1 1958 10 00004 1 1959

With that, I was then able to create a table where for each "ID", there was every single "year", and if the author published in that year, the value was 1, otherwise it was 0: a<-table(tab[,1], tab[,3])

Calling head(a,1) returns the following table: pic


I would like to know how to achieve the desired result I described above. Namely, having a table where rows would be populated with "ID"s, columns would be populated with "year"s (from 1930 to 2017), and underneath each year, there would be an actual "num_pub" value or a 0 value. The structure of the table would be just like the one shown in the pic

Thank you for your time and help. I'm very new to R, and kind of stuck in the mud with this.

Edit: the reshape approach as explained here does not solve my problem. I need zeros in place of "NA"s, and I want my year to start with 1930 instead of the first year that the author has published.

Blitva
  • 121
  • 1
  • 11
  • Possible duplicate of [How to reshape data from long to wide format?](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) – jay.sf Feb 02 '19 at 15:34
  • I tried this approach, but this way I end up with "NA" values instead of zeros. Also, the years with "num_pub" values end up bundled at the beginning, instead of the table starting with year 1930. Thank you all the same though, I didn't know of this approach. – Blitva Feb 02 '19 at 15:49
  • Simply replace NAs with 0s in a second step: `dat[is.na(dat)] <- 0`. Your second requirement can be solved by building a fake dataset with all author-year combos and then merging onto it. something like `squareData <- expand.grid(authors=authorVec, years=1930:2018)`. – lmo Feb 02 '19 at 18:00

3 Answers3

3

using reshape2 & dcast one can change to a wide format and then pipe through to replace NAs with 0s.

library(reshape2)
library(dplyr)

dcast(tab, ID~year, value.var = "num_pub") %>% 
  replace(is.na(.), 0)

     ID 1957 1958 1959 1977 1978 1983 1984 1990 1994 1996
1 00002    0    0    0    1    2    1    4    3    1    2
2 00004    3    1    1    0    0    0    0    0    0    0
nycrefugee
  • 1,629
  • 1
  • 10
  • 23
  • 1
    Thank you very much. This is precisely the output I was looking for. – Blitva Feb 05 '19 at 16:52
  • 1
    I happen to think this is an excellent answer because the code is essentially self-documenting, but there are some SO purists who might complain that there is insuficient explication. – IRTFM Feb 05 '19 at 17:56
2

You can use complete to fill in the zeros for non available data, and then spread to turn your column of years into multiple columns (both from the tidyr package):

library(tidyr)

df_complete <-
  complete(df, ID, year, fill = list(num_pub = 0))

spread(df_complete, key = year, value = num_pub)

# A tibble: 2 x 11
  ID    `1957` `1958` `1959` `1977` `1978` `1983` `1984` `1990` `1994` `1996`
  <fct>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 00002      0      0      0      1      2      1      4      3      1      2
2 00004      3      1      1      0      0      0      0      0      0      0

Data:

df <-
  data.frame(ID = c("00002", "00002", "00002", "00002", "00002", "00002", "00002", "00004", "00004", "00004"),
             num_pub = c(1, 2, 1, 4, 3, 1, 2, 3, 1, 1),
             year = c(1977, 1978, 1983, 1984, 1990, 1994, 1996, 1957, 1958, 1959))
g_t_m
  • 704
  • 4
  • 9
1

In base R this might be handled with a merge operation followed by some coercion to 0/1 by way of negating is.na and using as.numeric. (Admittedly the complete function appears easier.

temp <-  merge(expand.grid(ID=sprintf("%05d", 2:4),year=1930:2018), tab, all=T) 
str(temp)
#--------
'data.frame':   267 obs. of  3 variables:
 $ ID     : Factor w/ 3 levels "00002","00003",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year   : int  1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 ...
 $ num_pub: num  NA NA NA NA NA NA NA NA NA NA ...

 temp$any_pub <- as.numeric(!is.na(temp$num_pub))

 head(temp)
     ID year num_pub any_pub
1 00002 1930      NA       0
2 00002 1931      NA       0
3 00002 1932      NA       0
4 00002 1933      NA       0
5 00002 1934      NA       0
6 00002 1935      NA       0

tapply(temp$any_pub, temp$ID,sum)
#
00002 00003 00004 
    7     0     3 
IRTFM
  • 258,963
  • 21
  • 364
  • 487