1

I need to count Japanese characters percentages of every sentences in R. I split text into sentences and it looks like below :

> text  
[1]  "若い人が仕事がつまらない会社が面白くないというのはなぜか"
[2]  "それは要するに自分のやることを人が与えてくれると思っているからです"
[3]  "でも会社が自分にあった仕事をくれるわけではありません"

I want to get number of hiragana characters in each sentence. I have a txt file to search hiragana characters in it. I can do it for a single sentence but can`t apply to all sentences. For one sentence I do it like this :

> hiragana<-scan("hiragana.txt",what="char")
> hiragana<-unlist(strsplit(hiragana,"")) #hiragana list to search in sentences
> b<-text[3]
> b<-unlist(strsplit(b,"")) # so that I can search characters in the sentence
> b
[1] "若" "い" "人" "が" "仕" "事" "が" "つ" "ま" "ら" "な" "い" "会" "社"
[15] "が" "面" "白" "く" "な" "い" "と" "い" "う" "の" "は" "な" "ぜ" "か"
> b[(b %in% hiragana)]
[1] "い" "が" "が" "つ" "ま" "ら" "な" "い" "が" "く" "な" "い" "と" "い"
[15] "う" "の" "は" "な" "ぜ" "か"
> length(b[(b %in% hiragana)])
[1] 20

My question is how can I make it work for more than one sentences. I need an output like this :

>output
[1]  20
[2]  28
[3]  20

My problem is similar to this but i want to apply this to each sentences, not a specific one.

Any opinions?

Community
  • 1
  • 1
  • 4
    Can you provide an actual example of input and expected output? This may be as simple as using `nchar()` but I don't know Japnese or quite understand what you're looking for. – David Oct 23 '13 at 16:26
  • In addition to your existing code, please add a sample of your data to the question please, using [dput](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) or similar. – SlowLearner Oct 23 '13 at 22:22
  • 1
    `nchar` works fine with Japanese: `nchar( "あいうえお" )` correctly returns 5. – Vincent Zoonekynd Oct 23 '13 at 23:48
  • What is the structure of your file? Does each line contain exactly one sentence? Can sentences split across multiple lines? Does the text flow left to right, top to bottom, or top-bottom, right-left? – Scott Ritchie Oct 26 '13 at 03:22
  • @Manetheran each line contains one sentence. Text flow is from left to right, top to bottom. My problem is similar to [this](http://stackoverflow.com/questions/14928326/r-count-matches-between-characters-of-one-string-and-another-no-replacement) but i want to apply this to each sentences, not a specific one. – user2887321 Oct 26 '13 at 07:22

2 Answers2

1

Thank you for answers everybody. I found out how to solve my problem. It was easier than I thought. Here's the solution :

text<-readLines(filename)
text<-unlist(strsplit(text, "。")) #splits text into sentences
nchar(gsub("[ぁ-ん]","",text) #shows hiragana count (also shows katakana characters if you fix ぁ-ん part with katakana ones)
0

Using readLines you can simply wrap your code in a for loop that reads your hiragana file line by line:

conn <- file("hiragana.txt", "rt")
nLines <- system("wc -l hiragana.txt")
output <- rep(0, nLines)
for (i in 1:nLines) {
  line <- readLines(conn, n=1, warn=FALSE)
  chars <- strsplit(line, "")[[1]]
  nHiragana <- sum(chars %in% hiragana, na.rm=TRUE)
  output[i] <- nHiragana   
}
close(conn)
output
Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64