0

I have trained a random forest classifier on a very small set of data. There is only one feature, 'Position' with the target 'Relevance'. My code is very short and simple and can be found here https://github.com/sakshamyadav/ocm_test/blob/master/Untitled.ipynb

What I want to do now, is the following:

  • Input any csv file with a column of 'Positions'
  • Put it through my trained random forest algorithm to determine which are Relevant and which aren't (1 or 0)
  • Remove all the rows where Relevant is 0.
  • Save the result as a csv

Also, I would appreciate any feedback or suggestions on my method as I am very new to machine learning and would be very interested in knowing if there is an easier way to achieve this task or if it can be improved etc. Thanks very much in advance :)

P.S The example dataset I provided in my jupyter notebook code is completely random, I don't mean to put down any professsion.

Programmer
  • 1,266
  • 5
  • 23
  • 44

1 Answers1

0

Asssuming variable names from your code:

df = pd.read_csv('file_name.csv')
df = df[rfc.predict(df['Position']) != 0]
df.to_csv('new_clean_file.csv')
RafaelLopes
  • 473
  • 2
  • 7
  • Hi Rafael! All the positions in `file_name.csv` are strings though, so I get the error `ValueError: could not convert string to float: 'Director Marketing, Communications & Online`. Do I have to somehow convert those to numbers first or something? – Programmer Sep 13 '17 at 15:35
  • Yes convert it `pd.to_numeric(df['Position'], errors='coerce')` https://stackoverflow.com/questions/42719749/pandas-convert-string-to-int – RafaelLopes Sep 13 '17 at 20:51