1

I have been using WEKA to do some text classification work and I want to try out R.

The problem is I cannot load the String to Vector ARFF files created by WEKA's string parser into Rattle .

Looking at the logs I get something like:

/Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,

: scan() expected 'a real', got '2281}'/

My ARFF data file looks a bit like this:

@relation 'reviewData'

@attribute polarity {0,2}
.....
@attribute $$ numeric
@attribute we numeric
@attribute wer numeric
@attribute win numeric
@attribute work numeric

@data
{0 2,63 1,71 1,100 1,112 1,140 1,186 1,228 1}
{14 1,40 1,48 1,52 1,61 1,146 1}
{2 1,41 1,43 1,57 1,71 1,79 1,106 1,108 1,133 1,146 1,149 1,158 1,201 1}
{0 2,6 1,25 1,29 1,42 1,49 1,69 1,82 1,108 1,116 1,138 1,140 1,155 1}
..../

Any ideas how I can convert this into an R readable format?

Cheers!

Amro
  • 123,847
  • 25
  • 243
  • 454
NightWolf
  • 7,694
  • 9
  • 74
  • 121
  • 1
    Have you tried using the `read.arff` command? – joran Aug 04 '11 at 15:14
  • 2
    That's the `read.arff` function from the RWeka package. `install.packages("RWeka")` should install it. – Spacedman Aug 04 '11 at 15:51
  • @Spacedman - There's also one in `foreign`, and glancing at the source they don't appear exactly the same. I haven't used either, though, so I can't comment on which is preferable. – joran Aug 04 '11 at 15:55
  • 1
    Your right Rattle seems to use the read.arff in the foreign package. Is there anyway to force it to use the read.arff in RWeka? I tried loading RWeka lib and detaching foreign but no luck, R just re-attaches foreign – NightWolf Aug 05 '11 at 11:52

1 Answers1

0

When you save the result of the StringToWordVector attribute filter, it will be saved as a sparse ARFF file.

You need to check if Rattle supports reading this format. If not, you can apply the SparseToNonSparse instance filter, which will convert it to a dense matrix format (file size will be much larger).

Example: if the sparse data looks like:

sparse.arff

@relation name
@attribute word1 numeric
@attribute word2 numeric
..
@attribute word10 numeric
@data
{0 1,3 3,8 1,9 1}
{2 2,5 1,8 1,9 1}

it will be converted to:

nonsparse.arff

@relation name
@attribute word1 numeric
@attribute word2 numeric
..
@attribute word10 numeric
@data
1,0,0,3,0,0,0,0,1,1
0,0,2,0,0,1,0,0,1,1
Amro
  • 123,847
  • 25
  • 243
  • 454
  • Thanks this works, however it is very painful to load massive files. ANy work around? – NightWolf Aug 05 '11 at 11:51
  • @NightWol: since the sparse ARFF format is simple enough, you can maybe parse the file yourself in R, and store it in a [sparse matrix](http://stackoverflow.com/questions/1167448/most-mature-sparse-matrix-package-for-r) – Amro Aug 06 '11 at 13:06