How to find similarities between 2 datasets and generate a new dataframe consisting of these rows which coincide?

Question

I have the results of radiosonde observations for more than 1000 stations in one file and list of stations (81) that actually interest me. I need to make a new data frame where the first file's rows would be included.

So, I have two datasets imported from .txt files to R. The first is a data frame 6694668x6 and the second one is 81x1, where second dataset's rows conicide with some of first dataset's 1st column values (values are looking like this: ACM00078861).

d = data.frame(matrix(ncol = 6, nrow = 0)) 
for(i in 1:81){ 
  for (j in 1:6694668) {
    if(stations[i,1] == ghgt_00z.mly[j,1]){ 
      rbind(d,ghgt_00z.mly[j,] ) 
      j + 1 
    } else {j+1}
  }
}

I wanted to generate a new dataframe which would look like the "ghgt_00z.mly", but containing only the rows for the stations which are listed in "stations". Ofc, the code was running for couple of days and I have receaved only the warning message. Please, help me!

Could you provide reproducible examples of the two databases? — JDG, Oct 17 '19 at 08:38
@J.G. these are on my google drive if it is okay https://drive.google.com/file/d/1z0N3Q4l1h-2QBzPuiDuhXrp86VPG05qY/view?usp=drivesdk https://drive.google.com/file/d/1UDSfcitLuIKKfpzIvXtHmkyAuCdZZUaD/view?usp=drivesdk — Alina Lerner, Oct 17 '19 at 09:03

score 1 · Accepted Answer · answered Oct 17 '19 at 08:46

1

There's a lot of options how to do this. I persolaly use classic merge()

res <- merge(x=stations, y=ghgt_00z.mly, by='common_coulmn_name', all.x = TRUE)

Where common_coulmn_name is the same column name present in both df's. As a result you have combined two df's with all columns present in both datasets, you can remove them if you want.

Second useful option is:

library(dplyr)
inp <- ghgt_00z.mly$column_of_interest
res <- filter(stations, grepl(paste(inp, collapse="|"), column_in_stations))

Where inp and column_in_stations should contain some same values.

Due to I don't have datasets I can't check these solutions, so I don't guarantee if they work fine.

answered Oct 17 '19 at 08:46

Adamm

2,150
22
30

Thank you! Files are on my google drive, I would be really greatful if you could check https://drive.google.com/file/d/1UDSfcitLuIKKfpzIvXtHmkyAuCdZZUaD/view?usp=drivesdk – Alina Lerner Oct 17 '19 at 09:01
https://drive.google.com/file/d/1z0N3Q4l1h-2QBzPuiDuhXrp86VPG05qY/view?usp=drivesdk – Alina Lerner Oct 17 '19 at 09:01
Check out [this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) post. Put several rows of each df in your question. Create the data in R just like pearson who post his question [here](https://stackoverflow.com/questions/45137867/combining-columns-while-ignoring-duplicates-and-nas/45138444#45138444) – Adamm Oct 17 '19 at 09:07

How to find similarities between 2 datasets and generate a new dataframe consisting of these rows which coincide?

1 Answers1