0

new to R and trying to make the most of it. This is a task that needs to be done numerous times per week.

At work we get a SAS-code text file that contains the following data:

            1,2,3,201,202,203 = "Screening"
            101,102,301,302,404,405,1001= "Cycle 1 Day 1"
            1002 = "Cycle 1 Day 2"
            1003 = "Cycle 1 Day 3"
            103,104,303,304,407,408,409,410 = "Cycle 1 Day 8"
            105,106,305,306,412,413 = "Cycle 1 Day 15"
            107,108,307,308,414,415,416,417,1022= "Cycle 1 Day 22"
            1023 = "Cycle 1 Day 23"
            1024 = "Cycle 1 Day 24"
            109,110,309,310,418,419,420,421,2001 = "Cycle 2 Day 1"
            2002= "Cycle 2 Day 2"
            2003= "Cycle 2 Day 3"
            111,112,422,423 = "Cycle 2 Day 8"
            113,114,311,312,424,425 = "Cycle 2 Day 15"
            115,116,426,427= "Cycle 2 Day 22"
            117,118,119,313,314,315 = "Cycle 2 End of Cycle"
            120,121,316,317,430,431 = "Cycle 3 Day 1"
            122,123,432,433 = "Cycle 3 Day 8"
            124,125,318,319,434,435 = "Cycle 3 Day 15"
            126,127,436,437 = "Cycle 3 Day 22"
            128,129,320,321,438,439 = "Cycle 4 Day 1"

I also have an Excel file with the following content:

Visit No    Vis Label
1   Screening Day -14 to -1
2   Screening Day -14 to -1
3   Screening Day -14 to -1
101 Cycle 1 Day 1 to 3
102 Cycle 1 Day 1 to 3
103 Cycle 1 Day 8     
104 Cycle 1 Day 8     
105 Cycle 1 Day 15     
106 Cycle 1 Day 15     
107 Cycle 1 Day 22     
108 Cycle 1 Day 22     
109 Cycle 2 Day 1     
110 Cycle 2 Day 1     
111 Cycle 2 Day 8     
112 Cycle 2 Day 8     
113 Cycle 2 Day 15     
114 Cycle 2 Day 15     
115 Cycle 2 Day 22     
116 Cycle 2 Day 22     
117 Cycle 2 End of Cycle
118 Cycle 2 End of Cycle
119 Cycle 2 End of Cycle

Now I have to compare the SAS-code txt file to the Excel one, and jot down which ones are missing from the SAS file.

I tried reading the SAS txt file, which became filled with white space and was able to remove that and get the content by this:

d <- read.delim("testSAS.txt", sep = ":", strip.white = TRUE, skip = 2, header = FALSE)

which results in this:

                               V1             V2
1               1,2,3,201,202,203      Screening
2    101,102,301,302,404,405,1001  Cycle 1 Day 1
3                            1002  Cycle 1 Day 2
4                            1003  Cycle 1 Day 3
5 103,104,303,304,407,408,409,410  Cycle 1 Day 8
6         105,106,305,306,412,413 Cycle 1 Day 15

Now I want to separate the V1 values such that each one makes a new row with the same value attached to it, as shown below:

1   Screening
2   Screening
3   Screening
201 Screening
202 Screening
203 Screening

I have used the following code to make it more malleable but it seems it returns a deformed list:

df<- data.frame(matrix(unlist(d), nrow = 61, byrow = T),stringsAsFactors = FALSE)

And now want to compare it to the Excel file in order to finally generate either a txt or xlsx saying which V1 values were not present.

Is there a faster way to do this than by manipulating the SAS txt file? How should I proceed with this? is it faster doing it from the excel to the txt or vice versa?

Any advice is welcomed !

ewk
  • 17
  • 1
  • 8
  • You should [reshape](https://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format) your SAS data, then do a `anti-join`. That will return rows from `x` that are absent from `y`. See `?join` on how. Read in your `xlsx` tables with `openxlsx::read.xlsx`. – Anonymous coward Oct 31 '18 at 17:50
  • If you need more assistance, be sure to post a [reproducible example](https://stackoverflow.com/a/5963610/2359523). – Anonymous coward Oct 31 '18 at 17:56
  • Good question. You should check out `haven` (which imports SAS data into R in a 'tidy' way) and `readxl` (which does the same for Excel). Then look up the `tidyverse`. Once you have that, I think you'll have all that you need (for this and much more). Good luck – p0bs Oct 31 '18 at 18:27
  • Your SAS code looks like part of a SAS format definition. Do you have access to SAS? You could just run the code to define the format and then convert the definition to a dataset. – Tom Oct 31 '18 at 19:07
  • @Tom, I do not have access to SAS as of yet, probably in a month. This is the script that a colleague is using to perform a quality check with the excel file. You're saying to run the SAS format def, in R? – ewk Oct 31 '18 at 19:11
  • If you have complete SAS code to define a format then you can run that code in SAS to actually create the format catalog entry. You could then run more SAS code to convert that format into a dataset which you could either read with haven or export to a text file that would be easier to read than code. – Tom Oct 31 '18 at 19:30

1 Answers1

0

If your file is of the format .sas7bdat, then use the "haven" package to read the sas file like this. For reading xlsx file, you can use openxlsx file which wont have any Java dependencies so that should work fine. See example on how to do it below:

install.packages("haven")
library(haven)
sasfile <- haven::read_sas("path/to/sas/file.sas7bdat")

install.packages("openxlsx")
library(openxlsx)
xlsxfile <- openxlsx::read.xlsx(xlsxFile = "path/to/xlsx/file.xlsx")

Editing the original answer, now that I understand you wanted to melt it.

For melting your data the way you want, I would do something like this

install.packages("stringr")
install.packages("plyr")
library(stringr)
library(plyr)
new_df <-plyr::ddply(df, .(V2),function(x) data.frame(V1=unlist(stringr::str_split(df$V1, pattern = ","))))

To rearrange the columns as you wanted, you can do

new_df <- new_df[,c("V1","V2")]

Then use the "dataCompareR" which is custom build to compare 2 datasets in R. (The package is actually build in order to replicate sas proc compare procedure, so you can play with different arguments to see what you want to get)

install.packages("dataCompareR")
dataCompareR::rCompare(dfA = sasfile,dfB = xlsxfile)

Note: I did not test the code so let me know in the comments if it works.

Pang
  • 9,564
  • 146
  • 81
  • 122
Gompu
  • 415
  • 1
  • 6
  • 21
  • hey! It is solely in a txt file, I have no access to SAS files. Just like @Tom said above! – ewk Oct 31 '18 at 19:11