Visualization of missing/available data originating from different sources

Question

Here’s the situation :

I have 2 datasets containing the monthly prices of different financial products between 02/2013 and 09/2019 when they are available. These datasets have 47 observations (financial product IDs) and 80 variables (months).

Dataset 1 contains less data for this period, but the quality of the data is better.

The first part of my problem (which I have resolved) was to create a dataframe containing all available data from Dataset 1, and adding data available from Dataset 2 when it is absent from Dataset 1 to a new Dataframe that contains both all the quality data from Dataset 1 as well as supplementary data from dataset2.

Now, I am trying to illustrate this process through ggplot.

I have plotted the missing data separately for each dataset using

(1) missingness maps from the Amelia package

missmap(df,rank.order = FALSE,col = c("white","black"))

(2) md.pattern from the Mice package

md.pattern(df)

I would like to plot my final dataset containing both types of data in one of these formats, using a color code to clearly show data from dataset 2 having been added to data from dataset 1. Is this possible?

Here are subsets of both datasets:

dput(df1)
structure(list(`201811` = c(NA, NA, NA, NA, 95.5237185244587, 
NA, 97.5075873015873, NA, NA, NA), `201812` = c(NA, NA, NA, NA, 
95.2207352941176, NA, 98.6600228310502, NA, NA, NA), `201901` = c(NA, 
NA, NA, NA, 93.1981693949331, NA, 100.441459234609, NA, NA, 98.789
), `201902` = c(NA, NA, NA, NA, 98.1906626506024, NA, 100.144885961747, 
NA, NA, 99.029), `201903` = c(NA, 101.376, NA, NA, 100.10447592068, 
NA, 100.95874067937, NA, 103.374571428571, 99.743), `201904` = c(NA, 
101.966785714286, NA, NA, 101.686565217391, NA, 100.711654559226, 
NA, 103.411, 99.517)), row.names = c("929043AH0", "75884RAT0", 
"62943WAA7", "88104LAA1", "62943WAB5", "268317AS3", "037833BU3", 
"88104LAB9", "25389JAL0", "865622BY9"), class = "data.frame")

dput(df2)
structure(list(`201811` = c(97.069, 93.375, 99.8809, 94.576, 
99.849, 96.551, 93.5, 94.8075, 88.8982, 92.8731), `201812` = c(97.638, 
93.75, 99.9679, 94.613, 99.831, 96.692, 93.375, 94.8904, 89.1294, 
93.293), `201901` = c(98.506, 94.924, 99.9968, 96.488, 100.962, 
97.371, 93.75, 97.6666, 91.3518, 98.2993), `201902` = c(100.026, 
97.289, 99.9968, 96.92, 101.194, 97.274, 97.125, 97.8991, 93.3958, 
97.7391), `201903` = c(99.779, 96.78, 99.9968, 96.919, 101.315, 
97.691, 97.7515, 98.1629, 93.0283, 97.8553), `201904` = c(100.665, 
98.971, 99.9968, 98.289, 102.869, 98.402, 98.2492, 99.4818, 95.7858, 
100.6429)), row.names = c("929043AH0", "75884RAT0", "62943WAA7", 
"88104LAA1", "62943WAB5", "268317AS3", "037833BU3", "88104LAB9", 
"25389JAL0", "865622BY9"), class = "data.frame")

Hi! Certainly this is possible, but you will likely get faster and better answers if you provide a small data sample as per [how to ask](https://stackoverflow.com/help/how-to-ask). You can find some tips in the answers to [how to make a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), especially using `dput` and `head`. — Calum You, Sep 04 '19 at 19:40
Thanks for that feedback ! I've edited my question and integrated subsets of both dataframes. — Romain Berrou, Sep 05 '19 at 09:52
I can't find a way to put my column names in code in `dput` on here, since they're already contain quotes... — Romain Berrou, Sep 05 '19 at 12:44

Visualization of missing/available data originating from different sources

0 Answers0