1

I have a CSV file like

Identity,AdvertiserName,CampaignName,AdGroupName,Keyword,DestURL,KeystoneKW,,CampaignDuplicate,AdGroupDuplicate,CampaignLocation,,,,,,,,,
666,Bro Pest Control,cat|home & garden|pest control,kw|entry,Bro Pest Control,http://www.ci.com/profile/66/ab/brrd_pest_control.html,Pest Control,,NO,NO,"Ablle,Louna,United States",,,,,,,,,
447,Dist Tire Ctr Inc,cat|automotive sales & services|automotive repair,kw|entry,DisTire Ctr Inc,http://www.cit.com/profile/44/abbeville_la/discoutire_ctr_inc.html,Autepair,,NO,NO,"Abblle,Louana,United States",,,,,,,,,
6665,Best Control,geo|la|abbe la area,home & garden|pest control,Br Pest Control,http://www.cit.com/profile/66/abbee_la/broud_pest_control.html,Pest Control,,NO,NO,"A,Louisiana,United States",,,,,,,,,

My desired output is

 Identity,AdvertiserName,CampaignName,AdGroupName,Keyword,DestURL,KeystoneKW,,CampaignDuplicate,AdGroupDuplicate,CampaignLocation
666,Broud Pest Control,cat|home & garden|pest control,kw|entry,Bssad Pest Control,http://www.cit.com/profile/666/abbeville_la/brrd_pest_control.html,Pest Control,NO,NO,"Abbe,Louiana,United States"
44,DiscTire Ctr Inc,cat|automotive sales & services|automotive repair,kw|entry,Discount Tire Ctr Inc,http://www.cit.com/profile/44/ab/discouctr_inc.html,Automotive Repair,NO,NO,"Abbe,Loua,United States"

The piece of code I am using is

mydf <- read.csv("C:/Users/Administrator/Downloads/FinalLocationList1.csv", header=FALSE, skip=1)
d <- setNames(mydf[,sapply(mydf, function(x) all(!is.na(x)))],names(n))
z <- mydf <- Filter(function(x)!all(is.na(x)), mydf)

Credit - Thomas

But the above is not taking care of the Header problem? How to solve it? New to R. Any help is appreciated.

Edit : Output of dput(mydf)

structure(list(V1 = c(666L, 447L, 6665L), V2 = structure(c(2L, 
3L, 1L), .Label = c("Best Control", "Bro Pest Control", "Dist Tire Ctr Inc"
), class = "factor"), V3 = structure(c(2L, 1L, 3L), .Label = c("cat|automotive sales &   services|automotive repair", 
"cat|home & garden|pest control", "geo|la|abbe la area"), class = "factor"), 
V4 = structure(c(2L, 2L, 1L), .Label = c("home & garden|pest control", 
"kw|entry"), class = "factor"), V5 = structure(c(2L, 3L, 
1L), .Label = c("Br Pest Control", "Bro Pest Control", "DisTire Ctr Inc"
), class = "factor"), V6 = structure(1:3, .Label = c("http://www.ci.com/profile/66/ab /brrd_pest_control.html", 
"http://www.cit.com/profile/44/abbeville_la/discoutire_ctr_inc.html", 
"http://www.cit.com/profile/66/abbee_la/broud_pest_control.html"
), class = "factor"), V7 = structure(c(2L, 1L, 2L), .Label = c("Autepair", 
"Pest Control"), class = "factor"), V8 = c(NA, NA, NA), V9 = structure(c(1L, 
1L, 1L), .Label = "NO", class = "factor"), V10 = structure(c(1L, 
1L, 1L), .Label = "NO", class = "factor"), V11 = structure(c(3L, 
2L, 1L), .Label = c("A,Louisiana,United States", "Abblle,Louana,United States", 
"Ablle,Louna,United States"), class = "factor"), V12 = c(NA, 
NA, NA), V13 = c(NA, NA, NA), V14 = c(NA, NA, NA), V15 = c(NA, 
NA, NA), V16 = c(NA, NA, NA), V17 = c(NA, NA, NA), V18 = c(NA, 
NA, NA), V19 = c(NA, NA, NA), V20 = c(NA, NA, NA)), .Names = c("V1", 
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", 
"V12", "V13", "V14", "V15", "V16", "V17", "V18", "V19", "V20"
), class = "data.frame", row.names = c(NA, -3L))

Error

Error in setNames(mydf[, sapply(mydf, function(x) all(!is.na(x)))], names(n)) : 
'names' attribute [20] must be the same length as the vector [10]
user3188390
  • 603
  • 2
  • 11
  • 19
  • you are in right track, after reding `mydf`, i guess now your mydf is 8 col, so delete those 3 cols by `mydf<-mydf[,-c(2,5,7)]` and just use `names(mydf)<-c("Identity",......)` – Ananta Feb 28 '14 at 21:22
  • It would be easier to replace ",," with "," using a text editor prior to reading the file with R. If you can use sed it would even be faster. – Roland Feb 28 '14 at 21:25
  • possible duplicate of [How to remove empty columns in R?](http://stackoverflow.com/questions/22104962/how-to-remove-empty-columns-in-r) – Ricardo Oliveros-Ramos Feb 28 '14 at 21:48
  • It's fairly easy to see that you want `header=TRUE` since the first column is 'numeric'. – IRTFM Feb 28 '14 at 22:53
  • It is true that it is the same question but none of the the answers described the way to do this as part of the input operation. – IRTFM Feb 28 '14 at 22:55

2 Answers2

2

Use colClasses="NULL" for the columns you want to drop ... the last nine in this case, so rep("NULL",9)

tx <- 'Identity,AdvertiserName,CampaignName,AdGroupName,Keyword,DestURL,KeystoneKW,,CampaignDuplicate,AdGroupDuplicate,CampaignLocation,,,,,,,,,
666,Broud Pest Control,cat|home & garden|pest control,kw|entry,Bssad Pest Control,http://www.cit.com/profile/666/abbeville_la/brrd_pest_control.html,Pest Control,,NO,NO,"Abbe,Louiana,United States",,,,,,,,,
44,DiscTire Ctr Inc,cat|automotive sales & services|automotive repair,kw|entry,Discount Tire Ctr Inc,http://www.cit.com/profile/44/ab/discouctr_inc.html,Automotive Repair,,NO,NO,"Abbe,Loua,United States",,,,,,,,,'

df <- read.table(text=tx, sep=",", 
                 colClasses=c("numeric", rep("character",10), rep("NULL",9)), 
                 header=TRUE)

> str(df)
'data.frame':   2 obs. of  11 variables:
 $ Identity         : num  666 44
 $ AdvertiserName   : chr  "Broud Pest Control" "DiscTire Ctr Inc"
 $ CampaignName     : chr  "cat|home & garden|pest control" "cat|automotive sales & services|automotive repair"
 $ AdGroupName      : chr  "kw|entry" "kw|entry"
 $ Keyword          : chr  "Bssad Pest Control" "Discount Tire Ctr Inc"
 $ DestURL          : chr  "http://www.cit.com/profile/666/abbeville_la/brrd_pest_control.html" "http://www.cit.com/profile/44/ab/discouctr_inc.html"
 $ KeystoneKW       : chr  "Pest Control" "Automotive Repair"
 $ X                : chr  "" ""
 $ CampaignDuplicate: chr  "NO" "NO"
 $ AdGroupDuplicate : chr  "NO" "NO"
 $ CampaignLocation : chr  "Abbe,Louiana,United States" "Abbe,Loua,United States"
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • The result has 11 columns instead of 10 though I did remove it. Thanks for the answer I have been searching for a while. got to learn something new. – user3188390 Feb 28 '14 at 23:10
  • 1
    Yeah. I didn't notice until later that you had an embedded bpty column but that could be handled with `c("numeric", rep("character",6), "NULL, rep("character", 3), rep("NULL",9)), ` – IRTFM Feb 28 '14 at 23:14
  • +1, Nice trick. Documented, though I didn't notice it when scanning for an obvious argument to do just this. – BrodieG Feb 28 '14 at 23:20
  • @BrodieG,@IShould c("numeric", rep("character",6), "NULL, rep("character", 3), rep("NULL",9)) seems so there is an error should it be like c("numeric", rep("character",6), rep("NULL",7),rep("character", 3), rep("NULL",9)) - but then it shows error - more columns than column names. – user3188390 Feb 28 '14 at 23:24
  • Missing double quote after the first NULL. `c("numeric", rep("character",6), "NULL", rep("character", 3), rep("NULL",9)),` – IRTFM Mar 01 '14 at 00:09
0

You can try:

    setNames(
      Filter(function(x) !all(is.na(x)), mydf), 
      names(mydf)[-grep("^X(\\.[0-9]+)?$", names(mydf))]
    )

Produces:

  Identity Number Data Result Add
1        1      4   55     92  62
2        3      7   43     12  74
3        7      3   58     52  64
4        0      6   10     22  96
5        3      8   13     92  22

Filter will keep all columns that are not all NA. Then the grep piece relies on the names produced by read.CSV for blank columns (X, X.1, etc.) to filter out the wrong names. This should work generically.


EDIT: running with updated CSV produces:

> str(setNames(Filter(function(x) !all(is.na(x)), mydf), names(mydf)[-grep("^X(\\.[0-9]+)?", names(mydf))]))
'data.frame': 2 obs. of  10 variables:
 $ Identity         : int  666 44
 $ AdvertiserName   : Factor w/ 2 levels "Broud Pest Control",..: 1 2
 $ CampaignName     : Factor w/ 2 levels "cat|automotive sales & services|automotive repair",..: 2 1
 $ AdGroupName      : Factor w/ 1 level "kw|entry": 1 1
 $ Keyword          : Factor w/ 2 levels "Bssad Pest Control",..: 1 2
 $ DestURL          : Factor w/ 2 levels "http://www.cit.com/profile/44/ab/discouctr_inc.html",..: 2 1
 $ KeystoneKW       : Factor w/ 2 levels "Automotive Repair",..: 2 1
 $ CampaignDuplicate: Factor w/ 1 level "NO": 1 1
 $ AdGroupDuplicate : Factor w/ 1 level "NO": 1 1
 $ CampaignLocation : Factor w/ 2 levels "Abbe,Loua,United States",..: 2 1
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • Edited my question as well to the actual csv file, please take a look at it and in the meanwhile I will try the new edit as well. – user3188390 Feb 28 '14 at 21:49
  • @user3188390, the code seems to work with your file, at least for me. – BrodieG Feb 28 '14 at 21:51
  • The output is stating there are 19 columns instead of just 10. why? I am using the code like : mydf <- read.csv("C:/Users/Administrator/Downloads/FinalLocationList.csv",header=TRUE); z <- setNames(Filter(function(x) !all(is.na(x)), mydf), names(mydf)[-grep("^X(\\.[0-9]+)?", names(mydf))]); – user3188390 Feb 28 '14 at 22:02
  • 1
    Can you `dput(mydf)` and update your question with it? Hard to know for sure without working directly with your data. – BrodieG Feb 28 '14 at 22:38
  • Changed the input and as requested the output of dput(mydf) along with the error – user3188390 Feb 28 '14 at 22:53
  • Got the answer, I appreciate your help. I have also posted the updated question for your reference. – user3188390 Feb 28 '14 at 23:14