R - Extract values depending on multiple conditions

Question

I have 2 dataframes, here are the codes:

set.seed(100)

x1= rnorm(4)
x2= rnorm(4)
x3= rnorm(4)
x4= rnorm(4)
x5= rnorm(4)
x6= rnorm(4)
x7= rnorm(4)
x8= rnorm(4)
x9= rnorm(4)
x10= rnorm(4)
df1 = data.frame(Station1 = x1, Station2 = x2, Station3 = x3, Station4 = x4, Station5 = x5, Station6 = x6, Station7 = x7, Station8 = x8, Station9 = x9, Station10 = x10) 

x1= c("Station1", "Station2", "Station3", "Station4", "Station5", "Station6", "Station7", "Station8", "Station9", "Station10")
x2= seq(-2,10 , length=10)
x3= seq(30, 45, length=10)
x4= c(1, 3, 2, 1, 4, 2, 4, 3, 3, 1)
x5= seq(4, 16, length=10)
df2 = data.frame(Station=x1, Lon=x2, Lat=x3, Number=x4, Mis=x5)

Now I want to extract certain values of df1 and add them in a new column in df2. The df2$Number column goes from 1-4, just like the number of rows in df1. When there is a 1 in the df2$Number of Station1 I want to extract the value of the 1st row of df1 of the matching Station1. Another example would be: df2$Number for Station2 is 3, so I want to extract the value in the 3rd row of Station2 in df1, that would be -0.5817907.

All of these extracted values should be added in a new column in df2.

Here my examples:

>df1
    Station1   Station2    Station3    Station4   Station5   Station6   Station7    Station8   Station9  Station10
1 -0.50219235  0.1169713 -0.82525943 -0.20163395 -0.3888542 -0.4380900 -0.8143791 -1.15772946 -0.1379296
2  0.13153117  0.3186301 -0.35986213  0.73984050  0.5108563  0.7640606 -0.4384506  0.24707599 -0.1111935
3 -0.07891709 -0.5817907  0.08988614  0.12337950 -0.9138142  0.2619613 -0.7202216 -0.09111356 -0.6900143
4  0.88678481  0.7145327  0.09627446 -0.02931671  2.3102968  0.7734046  0.2309445  1.75737562 -0.2217942 0.1829077 0.4173233 1.0654023 0.9702020

> df2
    Station        Lon      Lat Number       Mis
1   Station1 -2.0000000 30.00000      1  4.000000
2   Station2 -0.6666667 31.66667      3  5.333333
3   Station3  0.6666667 33.33333      2  6.666667
4   Station4  2.0000000 35.00000      1  8.000000
5   Station5  3.3333333 36.66667      4  9.333333
6   Station6  4.6666667 38.33333      2 10.666667
7   Station7  6.0000000 40.00000      4 12.000000
8   Station8  7.3333333 41.66667      3 13.333333
9   Station9  8.6666667 43.33333      3 14.666667
10 Station10 10.0000000 45.00000      1 16.000000

Here are Links to my actual dataframes:

df1: https://megastore.uni-augsburg.de/get/IftpNjXNqL/

df2: https://megastore.uni-augsburg.de/get/8_Y6SPVCA9/

And just a question: Why do I get downvotes for my question right away? What is wrong with it? I am really trying to make it as clear as possible, provide all information and spend a lot of time in writing it. — Essi, Jan 27 '18 at 22:25
I think you should add a column `Number` to df1 and then `melt` df1 to 1 obs per `Number` per `Station`. Then do `df2` left join `df1` on `Station` and `Number`. — C8H10N4O2, Jan 27 '18 at 22:37
You asked what to do differently in the future. I thought your question was fine. But consider defining the dataset statically instead of randomly (with something like `dput()`, `data.frame()`, or `tibble::tribble()`). Also, consider specifying the desired output or vector (like `value` in my answer, and `New` in @erocoar's). I assume you've seen this, or something like it: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. — wibeasley, Jan 27 '18 at 23:13

erocoar · Accepted Answer · 2018-01-28T00:33:24.447

2

df2$Station <- as.numeric(gsub("[^0-9]", "", df2$Station))
df2$New <- apply(df2, 1, function(x) df1[x[4], x[1]])

df2

   Station        Lon      Lat Number       Mis         New
1        1 -2.0000000 30.00000      1  4.000000 -0.50219235
2        2 -0.6666667 31.66667      3  5.333333 -0.58179068
3        3  0.6666667 33.33333      2  6.666667 -0.35986213
4        4  2.0000000 35.00000      1  8.000000 -0.20163395
5        5  3.3333333 36.66667      4  9.333333  2.31029682
6        6  4.6666667 38.33333      2 10.666667  0.76406062
7        7  6.0000000 40.00000      4 12.000000  0.23094453
8        8  7.3333333 41.66667      3 13.333333 -0.09111356
9        9  8.6666667 43.33333      3 14.666667 -0.69001432
10      10 10.0000000 45.00000      1 16.000000  0.18290768

This would be one option - just change the Station column to numeric for indexing and get the values by looping through it with apply

Edit: For your real data, the column indexes are different and regex may or may not work - so I would do it this way:

MainFrame$New <- apply(MainFrame, 1, function(x) Centroids[x[3], x[1]])

edited Jan 28 '18 at 00:33

answered Jan 27 '18 at 22:30

erocoar

5,723
3
23
45

Thanks! Thats already a big step for me. I only need to save the real Stationnames. But that I can manage by myself! Thanks a lot! – Essi Jan 27 '18 at 22:44
Yeah, you could also just store the index created in a variable other than Station. Glad it helped :) – erocoar Jan 27 '18 at 22:46
Hi, I have big problems adapting the code to my real dataframe. I dont know whats the problem, but I get error massages all the time. It works so fine for my testframe. Could you please have a look on my real dataframes? I uploaded them and the link is above! – Essi Jan 28 '18 at 00:14
Thank you very much! That worked perfectly! The strange thing is that I tried that yesterday, 1:1 in the same way, but it didnt work. I changed the Stations to numerics, and it doesnt work then. – Essi Jan 28 '18 at 15:20

wibeasley · Answer 2 · 2018-01-27T22:47:37.237

2

A conventional database approach would convert df1 to a long/tall dataset. Then the selection is performed with a left join (with a package like dplyr, data.table, or even base::merge()).

library(magrittr)
df_value_long <- df1 %>% 
  dplyr::mutate(
    Number   = seq_len(n())
  ) %>% 
  tidyr::gather(Station, value, -Number)

df2b <- df2 %>% 
  dplyr::left_join(df_value_long, by=c("Station", "Number"))

df_value_long:

   Number   Station       value
1       1  Station1 -0.50219235
2       2  Station1  0.13153117
3       3  Station1 -0.07891709
4       4  Station1  0.88678481
5       1  Station2  0.11697127
6       2  Station2  0.31863009
7       3  Station2 -0.58179068
8       4  Station2  0.71453271
...

df2b:

     Station        Lon      Lat Number       Mis       value
1   Station1 -2.0000000 30.00000      1  4.000000 -0.50219235
2   Station2 -0.6666667 31.66667      3  5.333333 -0.58179068
3   Station3  0.6666667 33.33333      2  6.666667 -0.35986213
...

edited Jan 27 '18 at 22:47

answered Jan 27 '18 at 22:40

wibeasley

5,000
3
34
62

Thanks! I am not sure about converting. My real dataframe is huge it might take a lot of time. But on the other hand I dont have to overwrite my stations with this solution. I will look what is better for me. – Essi Jan 27 '18 at 22:49
Large datasets typically come from databases, and databases are usually designed to be tall. But if your original source is already wide, I see the advantage of @erocoar's approach. I added this to his approach mostly for the sake of completeness. – wibeasley Jan 27 '18 at 22:58
Thanks a lot! I really love to have different solutions. – Essi Jan 27 '18 at 23:22
Hi, I have big problems adapting the code to my real dataframe. I dont know whats the problem, but I get error massages all the time. It works so fine for my testframe. Could you please have a look on my real dataframes? I uploaded them and the link is above! – Essi Jan 28 '18 at 00:14
How are you reading those files into a data.frame? Can you include code? I'm trying variations of `df1 <- readr::read_delim("https://megastore.uni-augsburg.de/get/IftpNjXNqL/", delim=" ")`. It looks there are different problems with both. In `df1`, the first column (of integers) doesn't have a row header. In `df2`, I'm getting the error "invalid multibyte string 6" (using `readr::read_delim()` and `utils::read.delim()`) with the column like "H�he". – wibeasley Jan 28 '18 at 17:29
Oh, just download it! Its a .dat :/ I already have now a working solution. But thanks a lot for the effort. – Essi Jan 28 '18 at 23:09

Onyambu · Answer 3 · 2018-01-28T17:43:26.457

1

df1$NEW=df1[cbind(df2$Number,1:10)]
df1
     Station        Lon      Lat Number       Mis         NEW
1   Station1 -2.0000000 30.00000      1  4.000000 -0.50219235
2   Station2 -0.6666667 31.66667      3  5.333333 -0.58179068
3   Station3  0.6666667 33.33333      2  6.666667 -0.35986213
4   Station4  2.0000000 35.00000      1  8.000000 -0.20163395
5   Station5  3.3333333 36.66667      4  9.333333  2.31029682
6   Station6  4.6666667 38.33333      2 10.666667  0.76406062
7   Station7  6.0000000 40.00000      4 12.000000  0.23094453
8   Station8  7.3333333 41.66667      3 13.333333 -0.09111356
9   Station9  8.6666667 43.33333      3 14.666667 -0.69001432
10 Station10 10.0000000 45.00000      1 16.000000  0.18290768

edited Jan 28 '18 at 17:43

answered Jan 28 '18 at 05:52

Onyambu

67,392
3
24
53

That's pretty risky to assume for the entire dataset that the stations are always ordered the same way – erocoar Jan 28 '18 at 10:24
That’s an easy fix. You use gsub.. will fix that right now – Onyambu Jan 28 '18 at 14:37
Sure is :) Just saying – erocoar Jan 28 '18 at 14:40
After a thought I realized am drawing from every column sequentially..not bypassing any – Onyambu Jan 28 '18 at 17:41

R - Extract values depending on multiple conditions

3 Answers3