0

When I try to use the svmlight python package with data I already converted to svmlight format I get an error. It should be pretty basic, I don't understand what's happening. Here's the code:

import svmlight
training_data = open('thedata', "w")
model=svmlight.learn(training_data, type='classification', verbosity=0)

I've also tried:

training_data = numpy.load('thedata')

and

training_data = __import__('thedata')
mhawke
  • 84,695
  • 9
  • 117
  • 138
PF_learning
  • 25
  • 13

1 Answers1

2

One obvious problem is that you are truncating your data file when you open it because you are specifying write mode "w". This means that there will be no data to read.

Anyway, you don't need to read the file like that if your data file is like the one in this example, you need to import it because it is a python file. This should work:

import svmlight
from data import train0 as training_data    # assuming your data file is named data.py
# or you could use __import__()
#training_data = __import__('data').train0

model = svmlight.learn(training_data, type='classification', verbosity=0)

You might want to compare your data against that of the example.

Edit after data file format clarified

The input file needs to be parsed into a list of tuples like this:

[(target, [(feature_1, value_1), (feature_2, value_2), ... (feature_n, value_n)]),
 (target, [(feature_1, value_1), (feature_2, value_2), ... (feature_n, value_n)]),
 ...
]

The svmlight package does not appear to support reading from a file in the SVM file format, and there aren't any parsing functions, so it will have to be implemented in Python. SVM files look like this:

<target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>

so here is a parser that converts from the file format to that required by the svmlight package:

def svm_parse(filename):

    def _convert(t):
        """Convert feature and value to appropriate types"""
        return (int(t[0]), float(t[1]))

    with open(filename) as f:
        for line in f:
            line = line.strip()
            if not line.startswith('#'):
                line = line.split('#')[0].strip() # remove any trailing comment
                data = line.split()
                target = float(data[0])
                features = [_convert(feature.split(':')) for feature in data[1:]]
                yield (target, features)

And you can use it like this:

import svmlight

training_data = list(svm_parse('thedata'))
model=svmlight.learn(training_data, type='classification', verbosity=0)
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • When I try the open with "r" I get `TypeError: expected list of documents` . When I try the import I get `-1 0:1.173286269861675 1:0.4524566925178124 2:-0.9622895995173304 3:-0.0323228512901514 4:-0.3883630237637885 5:0.05964858946340369 6:0.4476052015809368 7:0.4476052012718441 8:0.03136114112311881 9:0.4500600446286898 10:0.4492788390876735 11:0.4479268098079717 12:0.4475026647089226 13:0.4479050146233448 14:0.4476815332854934 15:0.4474691649775809 16:0.4565717543476677 17:0.4475921191001453 ^ SyntaxError: invalid syntax` (the above is an example of an instance in my data file) – PF_learning Sep 04 '14 at 13:31
  • The data file is not in fact exactly equal, because I don't have structs, just the data. I used scikit learn array to svmlight format conversion function, I'm not sure how I can transform the data in a file with a struct instead... – PF_learning Sep 04 '14 at 13:33
  • Thank you. I've accepeted your answer, you helped me seeing the problem and you solved it. I get always zero when testing the constructed SVM model with the resulting data. I tried parsing it myself also just to be sure, and ended up achieving always zero when testing the model. I am not sure if both our data conversions are still wrong or if there's a problem with the implementation I'm using, but I suspect it's something else, so I guess this issue is closed. I'll update if I find out why I am getting always zero (I've tried the same data on other algorithms and the results are good) . – PF_learning Sep 07 '14 at 19:46
  • The data conversion is correct - it works on both test files available at http://download.joachims.org/svm_light/examples/example1.tar.gz. I also don't think that there is a problem with the implementation as the model file written by `svmlight.write_model()` contains data. Have you tested with [`simple.py`](https://bitbucket.org/wcauchois/pysvmlight/src/d8f3bb76d016fbab8d01f53b0bf84560bbbe6e05/examples/simple.py)? – mhawke Sep 08 '14 at 03:52
  • Yes, simply.py works as expected. Only with this data it doesn't work, I really don't know why. LIBSVM and LIBLINEAR work fine. I also tried changing `target = float(data[0])` to `target = int(data[0])` , because it could possibly only accept integers, but the results continue the same: all targets are 0 when trying to classify. I just noticed that the [data in the examples](https://bitbucket.org/wcauchois/pysvmlight/src/d8f3bb76d016fbab8d01f53b0bf84560bbbe6e05/examples/data.py?at=default) has targets of 1 and -1 in the train data, and only 0 in the test data, which doesn't even make sense. – PF_learning Sep 08 '14 at 13:08
  • I have uploaded the data, which I think is fine, but just in case you want to test it: [training data](http://bitshare.com/files/sk0ozbxq/poydf_train.html) [test data](http://bitshare.com/files/y5wnar34/poydf_test.html) – PF_learning Sep 08 '14 at 13:16