1

First of all, I have a matrix of features and a data.frame of features from two separate text sources. On each of those, I have performed different text mining methods. Now, I want to combine them but I know some of them have columns with identical names like the following:

> dtm.matrix[1:10,66:70]
       cough nasal sputum yellow intermitt
    1      1     0      0      0         0
    2      1     0      0      0         0
    3      0     0      0      0         0
    4      0     0      0      0         0
    5      0     0      0      0         0
    6      1     0      0      0         0
    7      0     0      0      0         0
    8      0     0      0      0         0
    9      0     0      0      0         0
    10     0     0      0      0         0

> dim(dtm.matrix) [1] 14300 6543

And the second set looks like this:

    > data1.sub[1:10,c(1,37:40)]
   Data number cough coughing up blood dehydration dental abscess
1            1     0                 0           0              0
2            3     1                 0           0              0
3            6     0                 0           0              0
4            8     0                 0           0              0
5            9     0                 0           0              0
6           11     1                 0           0              0
7           12     0                 0           0              0
8           13     0                 0           0              0
9           15     0                 0           0              0
10          16     1                 0           0              0
> dim(data1.sub)
[1] 14300   168

I got this code from this topic but I'm new to R and I still need some help with it:

    `data1.sub.merged <- dcast.data.table(merge(
    ## melt the first data.frame and set the key as ID and variable
    setkey(melt(as.data.table(data1.sub), id.vars = "Data number"), "Data number", variable), 
  ## melt the second data.frame
  melt(as.data.table(dtm.matrix), id.vars = "Data number"), 
  ## you'll have 2 value columns...
  all = TRUE)[, value := ifelse(
  ## ... combine them into 1 with ifelse
  (value.x == 0), value.y, value.x)], 
  ## This is the reshaping formula
  "Data number" ~ variable, value.var = "value")`

When I run this code, it returns a matrix of 1x6667 and doesn't merge the "cough" (or any other column) from the two data sets together. I'm confused. Could you help me how this works?

Diana01
  • 183
  • 1
  • 1
  • 10
  • 1
    it seems like you are simply trying to merge these two objects on a column shared in each called "Data number" if this is the case, given that you want to preserve the rest of your data, a simple merge would work, the essential thing is to have all of the columns which are duplicated in both sets (to be merged on) be explicitly referenced as keys. And in order for that to work, all data in all keys from both sets must agree with the same columns and rows in the other set. You should probably convert them both to the same type of data structure first also, like a data frame. – sconfluentus Jul 29 '17 at 22:25
  • I am going to merge these objects together for all the columns that don't have an identical name. For columns that do have an identical name, I want to merge both columns into one column while keeping all the =1 values from both objects in them. If this is clear, could you show me a sample of code that works? – Diana01 Jul 29 '17 at 23:17
  • 2
    Welcome to StackOverflow! As a matter of site policy (and even moreso within the `r` tag) we ask that you provide a Minimal, Complete, Verifiable Example (MCVE) rather than just showing us what the data looks like. Please provide reproducible data, such as by `dput` or using a builtin data set in your question. Please see http://stackoverflow.com/help/mcve and https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Hack-R Jul 29 '17 at 23:23
  • Is there a column in each, that is a unique ID number for each column which is not "data" upon which you will merge them, or are you looking at just aligning the columns of the two sets up side by side and substituting across for the duplicate columns with shared values? – sconfluentus Jul 30 '17 at 02:02

1 Answers1

1

There are many ways to do that, f.e. using base R, data.table or dplyr. The choice depends on the volume of your data, and if you, say, work with very large matrices (which is usually the case with natural language processing and bag of words representation), you may need to play with different ways to solve your problem and profile the better (=the quickest) solution. I did what you wanted via dplyr. This is a bit ugly but it works. I just merge two dataframes, then use for cycle for those variables which exist in both dataframes: sum them up (variable.x and variable.y) and then delete em. Note that I changed a bit your column names for reproducibility, but it shouldn't have any impact. Please let me know if that works for you.

df1 <- read.table(text = 
'     cough nasal sputum yellow intermitt
1      1     0      0      0         0
2      1     0      0      0         0
3      0     0      0      0         0
4      0     0      0      0         0
5      0     0      0      0         0
6      1     0      0      0         0
7      0     0      0      0         0
8      0     0      0      0         0
9      0     0      0      0         0
10     0     0      0      0         0')

df2 <- read.table(text = 
'   Data_number cough coughing_up_blood dehydration dental_abscess
1            1     0                 0           0              0
2            3     1                 0           0              0
3            6     0                 0           0              0
4            8     0                 0           0              0
5            9     0                 0           0              0
6           11     1                 0           0              0
7           12     0                 0           0              0
8           13     0                 0           0              0
9           15     0                 0           0              0
10          16     1                 0           0              0')

# Check what variables are common
common <- intersect(names(df1),names(df2))

# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))

# Merge dataframes
df <- merge(df1, df2,by = "ID")

# Sum and clean common variables left in merged dataframe
library(dplyr)

for (variable in common){
  # Create a summed variable
  df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
  # Delete columns with .x and .y suffixes
  df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}

df
   ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1   1     0      0      0         0           1                 0           0              0     1
2   2     0      0      0         0           3                 0           0              0     2
3   3     0      0      0         0           6                 0           0              0     0
4   4     0      0      0         0           8                 0           0              0     0
5   5     0      0      0         0           9                 0           0              0     0
6   6     0      0      0         0          11                 0           0              0     2
7   7     0      0      0         0          12                 0           0              0     0
8   8     0      0      0         0          13                 0           0              0     0
9   9     0      0      0         0          15                 0           0              0     0
10 10     0      0      0         0          16                 0           0              0     1
Alex Knorre
  • 620
  • 4
  • 15