I am using RQDA which is a package in rstudio to code text manually. The final rqda-file is a sql database.
I have coded statements in texts and used different codes and subsumed
them in code categories (for instance: code category "actor_party" and
then the related codes "socialist", "liberal", "conservative" etc.). I
finished the coding and want to perform a social network analysis with
it. For this purpose, I want to create a sql database, so that each code
category gets its own column with the specific codes in each row.
Every code can be identified by the following attributes: catid (=code category number), fid (file identification number) & selfirst (start of each code). By doing so, the specific catid, the fid & selfirst are selected for each
coded statement so that sqlite can identify each coding as unique (additionally and as you can see below in the R script, status=1 for each valid coding has to be elected too).
I use rstudio in the version 0.99.879, rqda in the version 0.2-7 and rsqlite 1.0.0.
So, the following R code is used:
library(RSQLite) # load Package RSQLite
setwd("C:/...")
system("ls *.rqda", show=TRUE)
sqlite <- dbDriver("SQLite")
#specifing the file
qdadb <- dbConnect(sqlite,"My_data.rqda")
dbListTables(qdadb)
dbListFields(qdadb, "coding") # that's where the codings are stored
catid <- dbGetQuery(qdadb, "select distinct(catid) from treecode where status = 1 ORDER BY catid")
i <- 1
table <- dbGetQuery(qdadb, "select fid, selfirst from coding where status = 1 GROUP BY fid, selfirst")
while(i <= max(catid)) {
ids <- dbGetQuery(qdadb, paste("select cid from treecode where (catid = ",i," and status = 1)", sep=""));
t <- dbGetQuery(qdadb, paste("select cid, fid, selfirst from coding where (cid in (", paste(as.character(ids$cid), sep="' '", collapse=","), ") and status = 1)", sep=""));
table <- merge(table, t, by = c("fid","selfirst"), all.x = T);
i <- i + 1;
}
# warnings are created because of the same columns which are duplicated by the merging
colnames(table) <- c("fid", "selfirst", dbGetQuery(qdadb, "select name from codecat where status = 1")[,1]) #each code has attributed a unique f(ile)id and selfirst (it's the unique starting point of each coding)
# see below for an example of such a created table
library(car) # Companion to Applied Regression package
# years - catid = 1
table$A00_time_frame <- recode(table$A00_time_frame, '1 = 2010; 2 = 2011; 3 = 2012; 4 = 2013; 5 = 2014; 6 = 2015')
# Sources - catid = 2
ids <- dbGetQuery(qdadb, "select cid from treecode where (catid = 2 and status = 1)")[,1]
values <- dbGetQuery(qdadb, paste("select name from freecode where (id in(", paste(ids, collapse = ","), ") and status = 1)"))[,1]
table$B00_source <- recode(table$B00_source, paste0("'", paste(ids,"'='", values, collapse = "';'", sep=""),"'", sep=""))
# Claimant type - catid = 3
ids <- dbGetQuery(qdadb, "select cid from treecode where (catid = 3 and status = 1)")[,1]
values <- dbGetQuery(qdadb, paste("select name from freecode where (id in(", paste(ids, collapse = ","), ") and status = 1)"))[,1]
table$C00_claimant_type <- recode(table$C00_claimant_type, paste0("'",
paste(ids,"'='", values, collapse = "';'", sep=""),"'", sep=""))
and so until "catid = 20"
This works out and looks like this: example_table [and this table goes on until row 844 - only the fid is ascending]
Even though, this works out and the created table matches with the total number of codings, some mistakes are happening. Some codes aren't linked to the correct statement (even though they are linked to the correct code category, but not to the correct and coded statement)
I am still a beginner in R(studio) and can't explain what went wrong.
Has anybody an idea what could be the problem or mistake here and how it could be fixed?
Upon request, I am happy to share my file :)
Any advice or help is very welcome!!
Edit:
Here is a link to a subset of my data that you can reproduce it (the file is in the rqda-format, because I think, the conversion of it might be the problem itself).
Moreover and to give you two example where to look at.
By creating the 'table' in R, the following row can be identified
1. - fid 95, selfirst 4553 and then the codings 'Welt' and then 'E02_European_Commission' + 'G10_Cameroon' later
However, if you check the coding in the original rqda-file, the code 'Cameroon' is not in this file, rather in fid 70, selfirst 5082 and in 'Welt' in year '2010'
- fid 90, selfirst 959 and year '2011' shows the codes 'CDU' and the last row 'special claimant' displays the name 'Martin Schulz'.
However, if you check the coding in the original rqda-file, there is no coding attached to the code 'Martin Schulz' in the subset.
- fid 90, selfirst 959 and year '2011' shows the codes 'CDU' and the last row 'special claimant' displays the name 'Martin Schulz'.
I hope, these two examples illustrate the problem and give you an idea where to look at respectively what the problem is.
Sorry that I haven't provided it in the first place!