1

I have an index with the following numbers (5, 10, 15, 17) This index is generated from a large csv file and corresponds to the order of these phrases from that file. Eventually id like to map these phrases back with the new columns my loop generates.

Each index is associated with a phrase. My code separates the phrase and creates columns based on words in the phrase. I need to create another column in my data frame with the index number that corresponds to each phrase.

For example: 
    column 1          column 2            index
    phrase A            book                5
    phrase A            tree                5
    phrase B            tree                10

How would I achieve this result within my loop and make sure the index is changing with every new input in column 1.

aa710
  • 69
  • 8
  • 1
    Possible duplicate of [Numbering rows within groups in a data frame](https://stackoverflow.com/questions/12925063/numbering-rows-within-groups-in-a-data-frame) – Reeza Jul 17 '19 at 17:27
  • 1
    `index = c(5, 10, 15, 17)`, `names(index) = c("phrase A", "phrase B", "phrase C", "phrase D")`. `your_data$index = index[your_data$column_1]`. – Gregor Thomas Jul 17 '19 at 17:33
  • 1
    I think this is not a dupe of Numbering within groups - OP wants the *same* index value for each group. – Gregor Thomas Jul 17 '19 at 17:34
  • @Gregor, Did you try it? Did it not work? Multiply by 5 ? – Reeza Jul 17 '19 at 17:37
  • 2
    @Reeza the OP wants the numbers to correspond to column 1, not column 2. I think they just want a join to a table describing how phrases map to indexes? (since they specify that they need specific indexes not just any numbering) – Calum You Jul 17 '19 at 17:41
  • Are the equal phrases consecutive? – Rui Barradas Jul 17 '19 at 17:42
  • Yeah, @Gregor, you're correct, I'm wrong! Thanks :) – Reeza Jul 17 '19 at 17:43
  • So another added later to this is that the phrases can repeat and you can be extracting different things from it in column b. I just need to create a column that changes index each time a new phrase is run in the loop. The index and phrases are sequential – aa710 Jul 17 '19 at 17:45

2 Answers2

3

Something like this?

index_by <- function(DF, group, index_list = NULL){
  f <- ave(as.character(DF[[group]]), DF[[group]], FUN = function(x) rnorm(1))
  i <- as.integer(factor(f, levels = unique(f)))
  if(is.null(index_list)) i else index_list[i]
}

df1$index <- index_by(df1, "column1")
df1$index2 <- index_by(df1, "column1", c(5, 10, 15, 17))

df1
#    column1 index index2
#1  phrase 1     1      5
#2  phrase 1     1      5
#3  phrase 1     1      5
#4  phrase 1     1      5
#5  phrase 2     2     10
#6  phrase 2     2     10
#7  phrase 3     3     15
#8  phrase 3     3     15
#9  phrase 3     3     15
#10 phrase 4     4     17

Data creation code.

set.seed(1234)
df1 <- data.frame(column1 = paste("phrase", rep(1:4, sample(4))))
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • They are not going by increments of 5, the index is a list of completely random numbers based on the phrases – aa710 Jul 17 '19 at 17:56
  • Then why does it matter if it's 5 or not? – Reeza Jul 17 '19 at 17:57
  • @aa710 Then the same method would do it, just remove the multiply. – Rui Barradas Jul 17 '19 at 17:58
  • the index is generated based on where these phrases are in a large csv, eventually i need to map it back to that file so the index for each individual phrase matters here – aa710 Jul 17 '19 at 17:58
  • Then you need to explain your base case better, I feel this answers the question you asked, but the question you asked is not the problem you actually want to solve. – Reeza Jul 17 '19 at 17:59
  • This seems to work, but how would I modify this code to correspond to my index list? which would be in the same order the phrases appear? Instead of 1,2,3,4 – aa710 Jul 17 '19 at 18:04
  • @aa710 I have changed the function to have another argument, `index_list`. The function now defaults to consecutive numbers, pass the index list and it will use it instead. The example calls the function in both ways. – Rui Barradas Jul 17 '19 at 18:34
  • @aa710 the initial comment from Gregor answered this then, did you try his solution? – Reeza Jul 17 '19 at 21:09
1

You can use group_indices() within tidyverse. Here's an example that groups the mpg data set by the manufacturer.

library(tidyverse)

mpgGroupNbr <- mpg %>%
  arrange(manufacturer) %>%
  group_by(manufacturer) %>% 
  mutate(groupNbr = group_indices()*5)

#check coding - max/min should be the same if coded correctly
mpgGroupNbr %>% 
  group_by(manufacturer) %>%
  summarize(max = max(groupNbr), min = min(groupNbr))

Results:

   manufacturer   max   min
    <chr>        <dbl> <dbl>
 1 audi             5     5
 2 chevrolet       10    10
 3 dodge           15    15
 4 ford            20    20
 5 honda           25    25
 6 hyundai         30    30
 7 jeep            35    35
 8 land rover      40    40
 9 lincoln         45    45
10 mercury         50    50
11 nissan          55    55
12 pontiac         60    60
13 subaru          65    65
14 toyota          70    70
15 volkswagen      75    75
Reeza
  • 20,510
  • 4
  • 21
  • 38