1

There is a given dummy dataset. I am trying to import it into the SAS environment and i also tried it in RStudio as well. But i was unable to grab the correct output. the dataset provided is as follows:

1    "name, age, salary, zipcode"
2    "A, 1, 100, 10010 B, 2, 200, 10011 C, 3, 300, 10012 D, 4, 400, 10014"

i copied the data from a CSV file. The line 1 was in the First cell of the CSV and Line 2 was in the second cell just below the first cell.

PS : 10010 is the zipcode where as B is the Name that is B is the starting of next observation and similiarly C is the next observation and D is the next observation.

The Desired output is:

Name   Age   Salary   Zipcode
 A      1     100      10010
 B      2     200      10011
 C      3     300      10012
 D      4     400      10014

i am stuck with this problem since a day ago. I am just able to move around the dataset in order to move the observations. Moreover the double qouted values are making this problem very complex. A working Solution on SAS or R would work great.

EDIT: The following code has been tried by me in SAS. Thanks SMW for pointing. :)

data new;
length Name $2 Age 8. Salary 8. Zipcode 8.;
infile 'book1.csv' dsd dlm = ',' firstobs = 2 LRECL =17 ;
input Name $ Age Salary Zipcode @@;
run;

Regards

desmond.carros
  • 372
  • 2
  • 21
  • Believe you need either the trailing "@" or you can use "\" to control what records go to what observation. Once you post an attempt of code I will know for sure, but not 100% certain of your question at this point. – SMW Jan 19 '16 at 12:53
  • @SMW Added the Edit in the query. Thanks much. Please let me know for any Catch in the code. – desmond.carros Jan 19 '16 at 13:38
  • Does the file literally have the quotes? Does it really have all of the data on the second line? You could probably get SAS to read it by telling SAS that space, comma and quotes are delimiters `DLM=' ,"'`. But if any of the values are blank or if the names have embedded spaces then it will get confused. – Tom Jan 19 '16 at 13:59
  • @Tom Yes the file literally have the quotes. and all the data is on the second line. But the space delimiter is at the end of each OBSERVATION. Not at every single value. That's the reason the above code didn't worked efficiently. Thanks for the reply. – desmond.carros Jan 19 '16 at 14:05

2 Answers2

3

Import the column names and prepare it

nam <- read.table("csv.csv", header=F, nrow=1, stringsAsFactors=F)
nam <- gsub(",", "", nam)
nam <- nam[2]                             # resolve " " in original data 
nam <- strsplit(nam, split=" ")           #

Capitalize the colnames

library(Hmisc)                   # edited
nam <- capitalize(nam)           # edited

Import the remaining file

dd <- read.table("csv.csv", skip=1, sep="", stringsAsFactors=F)

Reorder the file, extract first column

col_1 <- gsub(",", "", as.vector(dd[seq(from=1, to=16,by=4)])[1,])

Extract the remaining columns

col_all <- as.data.frame(t(matrix(dd, ncol=4)))

Clean commas

col_all <- apply(apply(col_all[-1], 2, gsub, patt=",", replace=""), 2, as.numeric)

and set the final object

data <- as.data.frame(cbind(col_1, col_all))
names(data) <- nam
data  # ouput

  Name Age Salary Zipcode
1    A   1    100   10010
2    B   2    200   10011
3    C   3    300   10012
4    D   4    400   10014

Edited: Capitalize only with base functions. Avoid library(Hmisc)and the following code line.

nam <- unlist(lapply(nam, function(x) {                    
   paste(toupper(substring(x, 1, 1)), substring(x, 2, max(nchar(nam))), sep="")
}))
PereG
  • 1,796
  • 2
  • 22
  • 23
  • Thanks for the answer :) But can it be done without using any external packages? i want to know if it could be done with standard available packages of RStudio. – desmond.carros Jan 20 '16 at 05:12
  • Included a base version. – PereG Jan 20 '16 at 07:44
  • Warning message: In read.table("C:/Users/Abhi Gupta/Desktop/file1.csv", skip = 1, : incomplete final line found by readTableHeader on 'C:/Users/Abhi Gupta/Desktop/file1.csv' Found this error in the console. – desmond.carros Jan 20 '16 at 08:16
  • I think it isn't a problem associated with this question. Hope it helps http://stackoverflow.com/questions/5990654/incomplete-final-line-warning-when-trying-to-read-a-csv-file-into-r – PereG Jan 20 '16 at 08:35
  • Getting some more errors too: 1. Error in `[.data.frame`(dd, seq(from = 1, to = 16, by = 4)) : undefined columns selected 2. > col_all <- apply(apply(col_all[-1], 2, gsub, patt=",", replace=""), 2, as.numeric) Error in apply(apply(col_all[-1], 2, gsub, patt = ",", replace = ""), : dim(X) must have a positive length – desmond.carros Jan 20 '16 at 08:43
  • can you indicate the result of `class (dd)` and `dim(dd)`? Then, have you used `as.vector (dd ...)`? – PereG Jan 20 '16 at 08:56
  • > class(dd) [1] "data.frame" > dim(dd) [1] 1 1 @PereG here is the dimension – desmond.carros Jan 20 '16 at 10:20
  • In your example, the dim(dd) must be 1 16, because read.table separated by whitespace. Check out your original file to see if it is. Only using a sample of your file could advance. – PereG Jan 20 '16 at 10:37
  • Try aditional argument in read.table()` task and a new clean task: `dd <- read.table("csv.csv", skip=1, sep="",stringsAsFactors=F, quote="")` and `dd <- gsub("[[:punct:]]", "", dd)` – PereG Jan 20 '16 at 10:44
  • The original file was imported directly and is not altered. I am not sure of what the error is about. – desmond.carros Jan 20 '16 at 10:48
  • > col_1 <- gsub(",", "", as.vector(dd[seq(from=1, to=16,by=4)])[1,]) Error in as.vector(dd[seq(from = 1, to = 16, by = 4)])[1, ] : incorrect number of dimensions – desmond.carros Jan 20 '16 at 10:50
  • I use to=16 because the example dataset was. Replace with `col_1 <- gsub(",", "", as.vector(dd[seq(from=1, to=length(dd),by=4)]))` – PereG Jan 20 '16 at 11:08
0

You can ask SAS to treat comma, blank and quote as delimiters and use trail @@ to allow it to read multiple observations from one line.

Let's build your example data file.

filename example temp;
data _null_;
  file example;
  put '"name, age, salary, zipcode"'
    / '"A, 1, 100, 10010 B, 2, 200, 10011 C, 3, 300, 10012 D, 4, 400, 10014"'
 ;
run;

And now read it.

data want;
  length Name $2 Age 8. Salary 8. Zipcode 8.;
  infile example dlm = ' ,"' firstobs = 2 ;
  input Name $ Age Salary Zipcode @@;
run;
Tom
  • 47,574
  • 2
  • 16
  • 29
  • That's a perfect explanation. But what if the data is very large then we have to import the data into the SAS environment. If import function can be used in the solution then it could be a better explanation. – desmond.carros Jan 20 '16 at 05:17
  • Thanks so much @Tom this logic worked perfectly with import command too. the main logic was to identify various delimiters in the given data. Thanks a ton mate. – desmond.carros Jan 20 '16 at 05:25
  • what if the dataset contains the missing values? for instance let me provide you with a dataset "name, age, salary, zipcode" "A, 1, 100, B, 2, 200, 10011 C, 3, 300, 10012 , 4, 400, 10014 E, 5, , 10015" – desmond.carros Jan 20 '16 at 08:21
  • It will not work if the missing values are represented by spaces. If the missing values are represented by a `.` then it will work. – Tom Jan 20 '16 at 14:25
  • If you want to use PROC IMPORT to convert your files than you need to make sure that they are in a cleaner format. I would go back to source and find out why the values got quoted as if they were a single string and why the end of line characters between the data rows are missing. – Tom Jan 20 '16 at 14:27
  • Thanks much @Tom it was of great help. – desmond.carros Jan 21 '16 at 05:10