0

I had a survey data with these variables:

df <- data.frame(Sex = c("Male","Female","Male","Female","Male"),
                 Age = c(19,20,34,56,45),
                 ExpansionFactor = c(123456789,31256789,127896543,251436978,536294817))

I want to create a report, but first I need to expand data survey without crashing my PC.

My desired data set:

Sex       Age
Male      19
.         .
.         .
.         .
Female    20
.         .
.         .
.         .
Male      34
.         .
.         .
.         . 
Female    56
.         .
.         .
.         .
Male      45
.         .
.         .
.         .
Male      45

dim(df)
[1] 1070341916 2 

Any suggestions?

Thank you very much for your help.

adircinho
  • 49
  • 4
  • 1
    Do you definitely need 1 billion raw records? What are you intending to do with the data once it is expanded? It is possible to do modelling on summary data without expanding, for instance. – thelatemail Feb 07 '20 at 00:49
  • Thank you very much for your answer. Well, This is a example, but I want to create a report. With my real data, I need to expand to population and this population is around 35 million of people. I'll appreciate any suggestion. – adircinho Feb 07 '20 at 00:53
  • 1
    you can try `tidyr::uncount(df, ExpansionFactor)` or other methods mentioned in https://stackoverflow.com/questions/2894775/repeat-each-row-of-data-frame-the-number-of-times-specified-in-a-column – Ronak Shah Feb 07 '20 at 01:13
  • I also wonder if you must expand the data (as opposed to parsing relevant information from files line by line) (But with N=3.5E7 and 32GB+ of RAM it should work). Yet, your report will summarize the data as well. If you feel you need that whole object in R, perhaps you could keep it in a more compact form, e.g. as Rle (see https://bioconductor.org/packages/release/bioc/html/S4Vectors.html) and giving different runValues to any sub-category that your data splits into. Depends on where the additional info will come from. Ex: `S4Vectors::Rle(paste(df$Sex, df$Age, sep=";"), df$ExpansionFactor)` – user12728748 Feb 07 '20 at 02:16

1 Answers1

2

I really do not understand why would you need the data that way. You can perfectly create a report using weighted summaries of the data, as follows.

data

library(ggplot2)
library(dplyr)

set.seed(123)

df <- data.frame(
  sex = sample(c("Male", "Female"), size = 100, replace = TRUE),
  age = rnorm(100, mean = 25, sd = 10),
  expansion.factor = sample(12:40, size = 100, replace = TRUE)
)

You can create summaries

df %>%
    group_by(sex) %>%
    summarise(
        count = sum(expansion.factor),
        mean_age = (sum(age * expansion.factor))/sum(expansion.factor),
        # There are packages with functions like this one
        mean_age2 = weighted.mean(age, expansion.factor)
    )

# A tibble: 2 x 4
  sex    count mean_age mean_age2
  <fct>  <int>    <dbl>     <dbl>
1 Female  1050     28.0      28.0
2 Male    1611     24.3      24.3

Visualizations using ggplot2

df %>%
    ggplot(aes(x = age, weight = expansion.factor)) +
    geom_histogram(bins = 20)

enter image description here

Johan Rosa
  • 2,797
  • 10
  • 18