1

I am working with a data set with ordinal variables as well as a column with text. In general, I would like to add columns that are results of a text mining exercise, maintaining the table structure.

For example, i have imported a CSV file data-subset.csv and obtained a data frame called datacsv

datacsv=read.csv("data-subset.csv", header=TRUE,sep=";")

The third column tekst contains text. I would like to search for numbers in that text (that will regularly lie between 0 and 1) in the context of "fte" and add these numbers as column fte. See:

>  luid  titel            tekst
>1 47300 docent wiskunde  De Stichting Openbaar Voortgezet Onderwijs 0,65
                          fte voltijd niveau: havo vwo
>2 43701 docent natuurkunde Speciaal onderwijs fulltime 2015 2016 fte 0,77 Haarlem
>3 43702 assistent        basisonderwijs Amsterdam fte 0,5 

i have installed packages like tm and quanteda

install.packages("tm", "quantada") library ("tm") library ("quanteda")

Without satisfying results, I have tried to use various kwic statements, such as

datacsv ["fte"]<- kwic(datacsv$"tekst", "fte", 4)

Does anyone know how to mine the text column and add the results as a column (or multiple columns)?

Thanks!

  • So it has strings with numbers and you want to extract the numbers? You should include reproducible example in your questions. Have a look at [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Sotos Mar 31 '16 at 14:44
  • 2
    Welcome to Stackoverflow ! Please provide a [mcve] – Steven Beaupré Mar 31 '16 at 14:47
  • Thanks to both of you. I have edited my question. Hopefully it is more usable now. – Yannick Bleeker Mar 31 '16 at 15:15

2 Answers2

0

This?

library(stringr) 
datacsv$fte <- str_extract_all(sapply(strsplit(datacsv$tekst, "fte "), "[", 2), '\\d+\\.*\\d*') 
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • This is in the right direction. However, the problem I face currently is that other cases contain multiple numbers. So, for example, the variable `tekst` also contains a year. This means that it is essential that R only extracts those numbers that are located around the term 'fte'. Please also see that `fte` is located at random positions between cases. – Yannick Bleeker Apr 01 '16 at 10:27
0

Maybe this works?

    library(dplyr)       
    mutate(datacsv, 
           fte = as.numeric(regmatches(tekst,regexpr("[[:digit:]]+\\.[[:digit:]]+", 
                                                     tekst))))
denise
  • 149
  • 14