-5

i want to create a regex function that takes the codes and set it like a reference dictionary to parse into the corpus and set them into a TDM with their occurrences

   corpus<- Corpus(DirSource(path))     
dictionary <- regexpr(("") , corp)
 regular <- DocumentTermMatrix(corp, control = list(dictionary = dictionary))

any one can help me resolving this problem

2 Answers2

1

You could use this regex to extract integers from 10000 to 600000:

\b(?:[1-5]?\d{5}|600000)\b
Andie2302
  • 4,825
  • 4
  • 24
  • 43
  • I want a solution how to apply a regex function in a TDM dictionary –  Aug 05 '18 at 18:30
0

I don't know much more of what you have or want, so does this help?

> txt <- c("asdlfk 9182 18273 sadfjk 182736 600001 aslkdfj", "091828 101922 foo 600000")
> gr <- gregexpr("\\b([1-9][0-9]{4}|[1-5][0-9]{5}|600000)\\b", txt)
> regmatches(txt, gr)
[[1]]
[1] "18273"  "182736"

[[2]]
[1] "101922" "600000"

> unlist(regmatches(txt, gr))
[1] "18273"  "182736" "101922" "600000"
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • This doesn't extract numbers from `60000` to `99999` – Toto Aug 05 '18 at 08:52
  • 1
    @ala, which is it, 10000 or 100000? `matrix` output? Yeah, you're going to have to explain yourself with expected output given some provided input (perhaps mine, or perhaps you can give a [reproducible question](https://stackoverflow.com/questions/5963269/) with sample data, code attempted, and expected output). ("Frequencies" is trivial with `table(unlist(regmatches(txt,gr)))`.) – r2evans Aug 05 '18 at 19:54
  • Still unclear to me: 1e4 or 1e5? I still don't have your data, so though it gives you an error, it does not for me. Unfortunately, I'm not familiar with `tdm` and not in a position to research it enough to come up with my own sample data that might come close to yours. Since I know nothing about `dic.txt`, I don't know `VCorpus`, `DirSource`, or `DocumentTermMatrix` (functions not found for me), and your code has typos (`reggular`), I'm unable to do anything other than what I've provided so far. Want better help? Please make this question *fully* reproducible. – r2evans Aug 05 '18 at 21:04