During these days, I build a keyword list from documents:
txtdir<-"E:/2015Ddtrial"
txtnames<-list.files(txtdir, pattern = ".txt", all.files = FALSE,
recursive = TRUE, include.dirs = FALSE, full.names=TRUE)
# prepared for data input
pkw<-vector()
#extract the keywords of the papers in the text files
for(i in 1:length(txtnames)) {
fc<-file(txtnames[i], open = "r", encoding = "UTF-8")
#alternative method of reading files: txtls<-readLines(con = fc, n=6, encoding = "unknown", skipNul = TRUE)
txtls<-scan(file = fc, what=character(), nmax = 6, sep = "\n", blank.lines.skip = TRUE, skipNul = TRUE, fileEncoding = "UTF-8")
rawvec<-grep("关键词", txtls, value = TRUE)
first<-regexpr("关键词", rawvec, ignore.case = FALSE)
last<-regexpr("分类号", rawvec, ignore.case = FALSE)
pkw<-append(pkw, substring(rawvec, first+4, last-2))
close(fc)
}
#combine all vectors into one line
kwdict<-gsub("^[[:blank:]+]|[[:blank:]+]$", "", pkw)
This is a example of multiple language list(kwdict) which I used for prepare a retrieval word list to a matrix. I am not realize it as a list with out confirm, until the scholars ask this question to me. The charactor "list" just like below:
[1] " 本体 构建 语义 协同 知识库 可视化 系统 架构 " [2] "" [3] " 知识 服务 知识 组织 体系 本体 语义 网 技术 " [4] "" [5] " 服务 接口 知识 组织 开放 查询 语义 推理 Web 服务 "
[6] " Solr 分面 搜索 标准 信息管理 " [7] " 语义 W i k i 标注 导航 检索 S e m a n t i c M e d i a W i k i P A U X I k e W i k i "
[8] " Liferay 主从 模式 集成 知识 平台 " [9] " 数据 摄取 SKE 本体 属性 映射 三元组 存储 " [10] "本体 实体 检索 查询 问答 关联 检索 可视化"
I tried to use “unlist” to break them down into words
expansion <- unlist(kwdict, recursive = FALSE)
But the result is same as what I input (listed above), unchanged. Now I realize it maybe not separate using “unlist”? But I've no idea how to do it. Could you recommend another method for me?