0

I am new to both R and SO and after figuring quite a few things in my dataset, I am kind of stuck on this new challenge. I am working on a .csv dataset and I am using r for datacleaning.

If you see, the first column label reads 'District/Subdistrict'. In that column, the District names start with a underscore and the sub district names are written as is. Now what I need to do is create a new column at the end, (column number 5) in my .csv with the label 'District'. I need to know how to use grepl and/or ifelse to populate that new column based on the first column as follows. I am going to use the example of the District name <_A>.

The new column should contain the values <_A> corresponding to the values of the District <_A> and values of Subdistricts under the District such as , , in the first column. Similarly, this should repeat for other districts such as the next District name <_E> and its subdistricts.

I know how to load the data in R and set the working directory etc. I just need specific help with the code for this output that I am looking for. Even some sort of a generic form would be helpful. Apologies for the shortcomings in this question.

Sample data:

    District/Subdistrict  X   Y   Z
           _A             10  12  13
            B             8   40  15
            C             21  22  23
            D             32  40  21
           _E             24  94  97
            F             56  72  12
            G             35  23  12
            H             54  23  17

Expected output

             District/Subdistrict  X   Y   Z   District
                   _A             10  12  13     _A
                    B             8   40  15     _A
                    C             21  22  23     _A
                    D             32  40  21     _A
                   _E             24  94  97     _E
                    F             56  72  12     _E
                    G             35  23  12     _E
                    H             54  23  17     _E
Nik
  • 1
  • 1
  • 3
    Please make this a complete question by including sample data and expected output directly _in the question_. – Tim Biegeleisen Dec 29 '17 at 06:59
  • 2
    It's really not the mission of SO to start at the very beginning and teach you how to do data entry and text processing with R . So what you do know how to do. We also don't accept questions with data in the form of pictures. That would imply that you believe we should redo your data entry. Sorry Chaley. – IRTFM Dec 29 '17 at 07:29
  • maybe you should try something like `sub(".*(_)","\\1",data[,1])` – Onyambu Dec 29 '17 at 08:22
  • @TimBiegeleisen I have included the same now. Apologies. – Nik Dec 29 '17 at 08:26

3 Answers3

0

Maybe this tidyverse variant could help you.

library(tidyverse)

Just for my purpose to get your data sample I create a tibble via tribble(). Because you already have your data as data.frame (I suppose) you can ignore it.

df <- tibble::tribble(~`District/Subdistrict`, ~X,  ~Y,  ~Z,
                      "_A",                    10,  12,  13,
                      "B",                      8,  40,  15,
                      "C",                     21,  22,  23,
                      "D",                     32,  40,  21,
                      "_E",                    24,  94,  97,
                      "F",                     56,  72,  12,
                      "G",                     35,  23,  12,
                      "H",                     54,  23,  17)

Now the code that hopefully helps you:

df %>% 
  mutate(District = if_else(grepl("^_", `District/Subdistrict`), `District/Subdistrict`, NA_character_)) %>% 
  fill(District) %>%
  as.data.frame()

#       District/Subdistrict  X  Y  Z District
# 1                       _A 10 12 13       _A
# 2                        B  8 40 15       _A
# 3                        C 21 22 23       _A
# 4                        D 32 40 21       _A
# 5                       _E 24 94 97       _E
# 6                        F 56 72 12       _E
# 7                        G 35 23 12       _E
# 8                        H 54 23 17       _E
giovannotti
  • 138
  • 6
  • For `na.locf` see for example this answer [https://stackoverflow.com/a/7735681/9148552](https://stackoverflow.com/a/7735681/9148552) to a similar problem – giovannotti Dec 29 '17 at 07:46
  • You don't need `zoo` for this task if you already load `tidyverse`, just use `tidyr::fill(Dis)`. You can also get rid of that `.$`. Hence, `df %>% mutate(Dis = if_else(grepl("^_", Dis.SubDis), Dis.SubDis, NA_character_)) %>% fill(Dis)` is sufficient. – Tino Dec 29 '17 at 08:40
  • Thanks @tino for style correction, I adjusted my answer – giovannotti Dec 29 '17 at 09:17
  • @giovannotti Thank you!! But I am getting an error: `could not find function "mutate"`. I have installed tidyverse, dplyr and magrittr. Any thoughts? – Nik Dec 29 '17 at 10:09
  • Could you give us some more information about the error? Did you load the library `tidyverse`? You can explicitly call `dplyr::mutate()`, `tidyr::fill()` but there seems to be another problem. – giovannotti Dec 29 '17 at 10:23
  • @giovannotti the dataframe has 7406 elements in the 'District/Sub District' variable. How do I include all those in the `tibble::tribble()` command? I believe the problem is there. – Nik Jan 02 '18 at 09:35
  • You don't have to use `tribble`. It was just my way to create a data frame from your data sample above. I suppose your data is of class `data.frame`, then the code works fine. `mutate`and `fill` transform the `data.frame` additionaly into a "tibble". If you want to get your data.frame back, just add `as.data.frame()` at the end. I will update my post. – giovannotti Jan 02 '18 at 09:52
  • I think it is because you forgot backticks around District/Sub District. But is that the correct name of your column? Data frames do not allow slashes in column names so I guess it has to be another string. – giovannotti Jan 02 '18 at 10:46
  • @giovannotti yeah I was missing the initial backticks and I also figured that slashes don't work. Now the code works without any errors. But the 'District' column is still all empty. – Nik Jan 02 '18 at 11:12
  • Could you update your question post with a original print example of your data frame? For example the output of `head()` or `dput(droplevels())`? I can't reproduce your empty column. – giovannotti Jan 02 '18 at 11:34
0

Are you looking for this?

 rep(grep("_",dat[,1],value = T),table(cumsum(grepl("_",dat[,1]))))
[1] "_A" "_A" "_A" "_A" "_E" "_E" "_E" "_E"

or even:

cut(m<-cumsum(s<-grepl("_",dat[,1])),length(unique(m)),dat[s,1])
[1] _A _A _A _A _E _E _E _E
Levels: _A _E
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

Here is another idea via ave,

with(df, ave(District.Subdistrict, cumsum(grepl('_', District.Subdistrict)), 
                                                           FUN = function(i) head(i, 1)))
#[1] _A _A _A _A _E _E _E _E
#Levels: _A _E B C D F G H
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • That worked!!! Thank you so much. However, the new column was created as a separate data frame. Is there a way to have it in the same data frame? – Nik Jan 03 '18 at 10:47