0

I am currently separating a dataframe with lists in each column and row. There are 3 columns: jobId (that is unique), skills, skillTypeId

I am hoping to create two new columns that separate those vectors in "skills" and "skillTypeId" and match them respectively. i.e. for example1:

original, and after

df <- structure(list(job.Id = "A", skill = list(c("microsoft excel", 
"product development")), skillTypeld = list(c(2, 2))), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -1L))

Currently, I managed to separate them by tackling creating a dataframe of "skills" and another of "skillTypeId". For "skills" dataframe, it will contain just jobId and skills. For "skillTypeId" dataframe, it will contain just jobId and skillTypeId. Then I use separate_rows. Eventually, I then use cbind to merge the two data frames together.

However, one problem arise: there were different number of entries (differ by 100+ rows out of the million rows). And I have too much data to troubleshoot which rows went wrong.

I understand that my approach is rather manual, hence I am hoping to get some help in making this less manual, and also most importantly, no missing rows.

Mark
  • 7,785
  • 2
  • 14
  • 34
Koh
  • 1
  • 1
  • 1
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos Jul 22 '20 at 09:31
  • What do you mean by different number of entries ? Is it that the value of `skill` for a certain `jobID` is of varying length ? – Romain Jul 22 '20 at 09:40
  • See https://stackoverflow.com/questions/15347282/split-delimited-strings-in-a-column-and-insert-as-new-rows and https://stackoverflow.com/questions/26194298/unlist-data-frame-column-preserving-information-from-other-column . – Ronak Shah Jul 22 '20 at 09:55
  • @Romain yup! After splitting them into 2 data frames, they should still have the same number of entries after separate_rows(). First dataframe with jobId & skill, the second with jobId & skillTypeId. In my example above, "microsoft excel" is of skillTypeId 2. And "product development" is of skillTypeId 2 as well. Each skill belongs to a skillTypeId. In my data, after unlisting each row, length of skill should thus = length of skillTypeId. But somehow it wasn't.. so I'm suspecting separate_rows() had remove some entries which were desirable. – Koh Jul 24 '20 at 01:44

1 Answers1

0
unnest_longer(df, c(skill, skillTypeld))

Read the documentation for more information on usage: https://tidyr.tidyverse.org/reference/unnest_longer.html

Mark
  • 7,785
  • 2
  • 14
  • 34