0

Hello everyone and many thanks in advance for your help,

Kit of the question

As can be seen above, I have to assign numerical values from column A to column B, so that the number 4193 in column B matches the number 1 in column A each and every time, the number 15 in column B matches the number 2 in column A each and every time, and so on and so forth (this is just a random sample that I've presented as an example, as I'm working with an extremely large dataset). This should have been no big deal, but the thing is that these two columns are of different size (column A is much larger than column B).

I've spent hours and hours trying to do this by myself, as well as browsing forums, but I haven't found any similar question on how to get around this problem. Also, because the dataset I'm working with is extremely large, there is no way I could do this manually.

The main idea would be to have each single number in column B repeated standing side by side with its corresponding number from the column A (as explained previously). I don't know how to do this computationally, but logically, the idea is either to enlarge column B, or to put the numbers into another column, say, column C.

I'd be enormously grateful if anyone could help me out with this. A gist on how to do this in R would be much appreciated.

Many thanks once again!

ADDENDA:

DATA


df <- structure(list(A = c(rep(list(1048575))
), B = c(rep(list(10571)))), class = "data.frame", row.names = c(-10571L, -1048575L))

After introducing this data and trying to run the code, I get the following error: "Internal error in df_slice(): Columns must match the data frame size.".

CODE

The closest I have got so far to the desired 1048575 observations in Column B is by using the following code:


Result <- match(A, unique(B))

However, the given output is a matrix of 'NA's, while what I'm looking for is to have two numerical columns of 1048575 observations each.

Similarly, this other code does give a numerical output, but it is still in a matrix form and the order is incorrect:


Result <- match((A %in% unique(B)), B, table(A)[seq_along(B)])

The desired output should look like this:

Output

Thanks once again!

ArtUr693
  • 43
  • 5
  • 1
    Are the values in column A groups? e.g group 1 , 2, 3 etc.. And most important is the data set ordered? – TarJae Apr 15 '22 at 18:55
  • 4
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please do not post image of data because we do not what to have to retype everything to test the code. – MrFlick Apr 15 '22 at 19:29
  • Thank you for your suggestion! I have just edited my question, including both data and code, as well as an example of the desired output (albeit in the form of a picture in this case). – ArtUr693 Apr 17 '22 at 16:05

1 Answers1

0

Here is one possible option if the data set is ordered using tidyverse. First, we keep only the rows with values in B, then we replace the values in A with just the unique values from A in the original df. Then, we can join that back to the original dataframe to fill all the rows with the corresponding values.

library(dplyr)

df %>%
  filter(complete.cases(B)) %>% 
  mutate(A = unique(df$A)) %>% 
  left_join(df %>% select(A), ., by = "A")

A base R option that is likely more efficient would be to use a combination of table and rep. Here, we take just the non-NA values from B, then repeat those values according to the frequency of each unique value in A, then we can just re-assign that to column B. Here, I also use factor with table to preserve the order of A.

df$B <- rep(df$B[!is.na(df$B)], table(factor(df$A, levels=unique(df$A))))

Output

   A    B
1  1 4193
2  1 4193
3  1 4193
4  1 4193
5  1 4193
6  1 4193
7  1 4193
8  2   15
9  2   15
10 2   15
11 3 9734
12 3 9734
13 3 9734
14 3 9734

Data

df <- structure(list(A = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
), B = c(4193, 15, 9734, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA)), class = "data.frame", row.names = c(NA, -14L))
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
  • Dear Andrew, many thanks for helping me out with this! The only problem is that your solution works just fine for 14 values, as in the example that you proposed. However, the thing is that I have 1.048.575 ordered values in Column A that I need to match with 10571 values in Column B. The "NA" part regarding the Column B doesn't seem to give a problem; however, I cannot list manually all 1048575 that belong to column A for the code to run. Is there any chance you would know how it is possible to get around this issue? – ArtUr693 Apr 16 '22 at 10:13
  • @ArtUr693 I'm not quite sure what you mean by manual, as nothing has to be manually entered here. But perhaps, you mean that you are running out of memory to be able to run it? Or if you have some other data structure, where you aren't starting with a dataframe, then you'll have to add a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). However, I just added a base R option that should be more efficient and use less memory. – AndrewGB Apr 16 '22 at 20:59
  • Andrew, many thanks once again for your kind attention! Maybe the problem that I'm having stems from the way I'm trying to introduce the data -because, otherwise, your code runs just fine in both cases for the example that you've used-. I've just "appended" the data, so that you can see, and -hopefully-, if you are so kind, correct me on the way I'm introducing the data. Being the dataset that I'm working with so large, most probably I just don't know how to structure the data so that all 1048575 values from Column A and all 10571 values from Column B are included. Many thanks once again! – ArtUr693 Apr 17 '22 at 11:18
  • Is your problem really how to include two columns with different lengths?If it is, you just need to read/import your original data and then store them in a `list` which is not a `data.frame`. `mydata<- list( A= imported_columnA, B= imported_column B)` – Abdur Rohman Apr 17 '22 at 13:58
  • Thank you for your suggestion, Abdur! I have tried to store the data into a list, and then run the following code: ' Result <- rep(df$B, table(df$A)[seq_along(df$B)]) '. However, I keep having the same problem (0s are missing from the output), and the output is in the form of a matrix, whereas I need it to be in the form of 2 columns. Many thanks in any case! – ArtUr693 Apr 17 '22 at 15:02