0

During these days, I build a keyword list from documents:

txtdir<-"E:/2015Ddtrial"
txtnames<-list.files(txtdir, pattern = ".txt", all.files = FALSE,
                     recursive = TRUE, include.dirs = FALSE, full.names=TRUE)
# prepared for data input
pkw<-vector()
#extract the keywords of the papers in the text files
for(i in 1:length(txtnames)) {
  fc<-file(txtnames[i], open = "r", encoding = "UTF-8")
  #alternative method of reading files: txtls<-readLines(con = fc, n=6, encoding = "unknown", skipNul = TRUE)
  txtls<-scan(file = fc, what=character(), nmax = 6, sep = "\n", blank.lines.skip = TRUE, skipNul = TRUE, fileEncoding = "UTF-8")
  rawvec<-grep("关键词", txtls, value = TRUE)
  first<-regexpr("关键词", rawvec, ignore.case = FALSE)
  last<-regexpr("分类号", rawvec, ignore.case = FALSE)
  pkw<-append(pkw, substring(rawvec, first+4, last-2))
  close(fc)
}
#combine all vectors into one line
kwdict<-gsub("^[[:blank:]+]|[[:blank:]+]$", "", pkw)

This is a example of multiple language list(kwdict) which I used for prepare a retrieval word list to a matrix. I am not realize it as a list with out confirm, until the scholars ask this question to me. The charactor "list" just like below:

[1] "  本体 构建   语义 协同   知识库   可视化   系统 架构    "                                                                         [2] ""                                                                          [3] "  知识 服务   知识 组织 体系   本体   语义 网 技术    "                                                                         [4] ""                                                                          [5] "    服务 接口   知识 组织   开放 查询   语义 推理   Web   服务 "                                                                        
[6] "    Solr   分面 搜索   标准 信息管理 "                                                                            [7] "  语义   W i k i   标注   导航   检索   S e m a n t i c M e d i a W i k i   P A U X   I k e W i k i    "
[8] "  Liferay   主从 模式   集成 知识 平台    "                                                                         [9] "    数据 摄取   SKE   本体   属性 映射   三元组 存储    "                                                                         [10] "本体   实体 检索   查询 问答   关联 检索   可视化"

I tried to use “unlist” to break them down into words

expansion <- unlist(kwdict, recursive = FALSE)

But the result is same as what I input (listed above), unchanged. Now I realize it maybe not separate using “unlist”? But I've no idea how to do it. Could you recommend another method for me?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
赵鸿丰
  • 185
  • 9
  • Please add ouput of `dput(kwdict)` to your post. [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269) – zx8754 Mar 09 '17 at 08:43
  • I'm not sure that your variable is a `list`. try `class(variable_name)`. – DJJ Mar 09 '17 at 08:52
  • Maybe `scan(text = kwdict, what = "")` ? – zx8754 Mar 09 '17 at 08:56
  • Possible duplicate of http://stackoverflow.com/questions/24741541/how-to-split-a-string-by-any-number-of-spaces and http://stackoverflow.com/questions/4350440/split-a-column-of-a-data-frame-to-multiple-columns – zx8754 Mar 09 '17 at 08:58
  • Thank you for your advise, I checked this parameter, the class of "kwdict" named "character"!!! I used "scan", "regexpr" and "append" to build it, how can it be?! However, is there any method for this pseudo list to separate it into single word? – 赵鸿丰 Mar 09 '17 at 09:05

0 Answers0