How to access column values of a TSV based on computationally read out keywords in Python/Pandas/R?

Question

I am going to describe my question with this minimal, reproducible example (reprex). I have two tab separated value files df1.map and df2.map. df1.map looks like this:

1       BICF2G630707759 0       3014448
1       BICF2S2358127   0       3068620
1       BICF2P1173580   0       3079928
1       BICF2G630707846 0       3082514
1       BICF2G630707893 0       3176980
1       TIGRP2P175_RS8671583    0       3198886
1       BICF2P1383091   0       3212349
1       TIGRP2P259_RS8993730    0       3249189
1       BICF2P186608    0       3265742
1       BICF2G630707908 0       3273096

df2.map is this:

1       BICF2P413765    0       85620576
1       BICF2P413768    0       85621395
1       BICF2P860004    0       85633349
1       BICF2G630707846   0       85660017
1       BICF2G630707893     0       85684560
1       CHR1_85660017   0       85660017
1       BICF2P91636     0       85684560
1       CHR1_85685260   0       85685260
1       BICF2P172347    0       85700399
1       BICF2P1125163   0       85707031

The second column in each file holds an ID information. The question is: Check for every ID in df1.map if it is present in df2.map. If it is present, take the information of that ID, of that row from the columns 1 and 4 from df2.map and over-write the columns 1 and 4 in df1.map with the information from df2.map.

I tried one solution with python (after converting from TSV to CSV):

import pandas as pd

CF2 = pd.read_csv('df1.csv', low_memory=False)
CF3 = pd.read_csv('df2.csv', low_memory=False)

i = 0
while i < CF2.shape[0]:
    if (CF2[CF2.columns[1]].values[i]) == (CF3[CF3.columns[1]]):
        print (CF3[CF3.columns[0]])
i = i + 1

Yet, that resulted only in errors.

Then I tried to solve it in R:

CFTwo <- read.table("df1.map", sep="\t", header=FALSE)
CFThree <- read.table("df2.map", sep="\t", header=FALSE)

for (i in CFThree [,2]) {
    if (i == CFTwo [,2]) {
        print(CFTwo [,c(1,2,4)])}}

That gave 50 warnings and no result. Since I was already not able to get the right values printed, I am far away from over-writing and changing files. Besides Python and R I am also open to a solution in bash.

You could use a left join with [tag:r]'s [tag:data.table] package, see [here](https://stackoverflow.com/q/34598139/2204410) and [here](https://stackoverflow.com/a/32884254/2204410) and [here](https://stackoverflow.com/q/43545307/2204410) for examples. — Jaap, Jun 25 '20 at 10:35

Len Greski · Accepted Answer · 2020-06-25T10:51:05.507

We can solve this problem by merging the second file with the first one, and checking for NA values to recode V1 and V4 with if_else(). A dplyr based solution looks like this:

file1 <- "1       BICF2G630707759 0       3014448
1       BICF2S2358127   0       3068620
1       BICF2P1173580   0       3079928
1       BICF2G630707846 0       3082514
1       BICF2G630707893 0       3176980
1       TIGRP2P175_RS8671583    0       3198886
1       BICF2P1383091   0       3212349
1       TIGRP2P259_RS8993730    0       3249189
1       BICF2P186608    0       3265742
1       BICF2G630707908 0       3273096"

file2 <- "1       BICF2P413765    0       85620576
1       BICF2P413768    0       85621395
1       BICF2P860004    0       85633349
1       BICF2G630707846   0       85660017
1       BICF2G630707893     0       85684560
1       CHR1_85660017   0       85660017
1       BICF2P91636     0       85684560
1       CHR1_85685260   0       85685260
1       BICF2P172347    0       85700399
1       BICF2P1125163   0       85707031"

We read the raw data, and for the first file use the default column names. For the second file, we change the names of every column except the key that joins with the first file. This avoids a duplicate column names problem when we join the files.

df1.map <- read.table(text=file1,header=FALSE)
df2.map <- read.table(text = file2,header = FALSE,
                      col.names = c("df2.V1","V2","df2.V3","df2.V4"))

Next, we join the data and use mutate() to set the values of V1 and V4 if the corresponding values of df2.V1 and df2.V4 are not NA. We use left_join() because the output data should include all rows from df1.map, regardless of whether df2.map contributed data to a particular row.

Once recoded, we drop the columns that begin with df2.* from the result data frame.

library(dplyr)
df1.map %>% left_join(.,df2.map) %>%
     mutate(V1 = if_else(is.na(df2.V1),V1,df2.V1),
            V4 = if_else(is.na(df2.V4),V4,df2.V4)) %>%
     select(.,-c(df2.V1,df2.V3,df2.V4))

...and the output:

Joining, by = "V2"
   V1                   V2 V3       V4
1   1      BICF2G630707759  0  3014448
2   1        BICF2S2358127  0  3068620
3   1        BICF2P1173580  0  3079928
4   1      BICF2G630707846  0 85660017
5   1      BICF2G630707893  0 85684560
6   1 TIGRP2P175_RS8671583  0  3198886
7   1        BICF2P1383091  0  3212349
8   1 TIGRP2P259_RS8993730  0  3249189
9   1         BICF2P186608  0  3265742
10  1      BICF2G630707908  0  3273096
>

Thank you for your answer, which worked here for the reprex and also for my actual bigger dataset - thanks! — PolII, Jun 25 '20 at 12:32

How to access column values of a TSV based on computationally read out keywords in Python/Pandas/R?

1 Answers1