Adding columns that are results of (contextual) text mining operations

Question

I am working with a data set with ordinal variables as well as a column with text. In general, I would like to add columns that are results of a text mining exercise, maintaining the table structure.

For example, i have imported a CSV file data-subset.csv and obtained a data frame called datacsv

datacsv=read.csv("data-subset.csv", header=TRUE,sep=";")

The third column tekst contains text. I would like to search for numbers in that text (that will regularly lie between 0 and 1) in the context of "fte" and add these numbers as column fte. See:

>  luid  titel            tekst
>1 47300 docent wiskunde  De Stichting Openbaar Voortgezet Onderwijs 0,65
                          fte voltijd niveau: havo vwo
>2 43701 docent natuurkunde Speciaal onderwijs fulltime 2015 2016 fte 0,77 Haarlem
>3 43702 assistent        basisonderwijs Amsterdam fte 0,5

i have installed packages like tm and quanteda

install.packages("tm", "quantada") library ("tm") library ("quanteda")

Without satisfying results, I have tried to use various kwic statements, such as

datacsv ["fte"]<- kwic(datacsv$"tekst", "fte", 4)

Does anyone know how to mine the text column and add the results as a column (or multiple columns)?

Thanks!

So it has strings with numbers and you want to extract the numbers? You should include reproducible example in your questions. Have a look at [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Sotos, Mar 31 '16 at 14:44
Thanks to both of you. I have edited my question. Hopefully it is more usable now. — Yannick Bleeker, Mar 31 '16 at 15:15

Sotos · Answer 1 · 2016-04-01T10:59:45.513

0

This?

library(stringr) 
datacsv$fte <- str_extract_all(sapply(strsplit(datacsv$tekst, "fte "), "[", 2), '\\d+\\.*\\d*')

edited Apr 01 '16 at 10:59

answered Mar 31 '16 at 14:52

Sotos

51,121
6
32
66

This is in the right direction. However, the problem I face currently is that other cases contain multiple numbers. So, for example, the variable `tekst` also contains a year. This means that it is essential that R only extracts those numbers that are located around the term 'fte'. Please also see that `fte` is located at random positions between cases. – Yannick Bleeker Apr 01 '16 at 10:27

score 0 · Answer 2 · answered Mar 31 '16 at 14:56

0

Maybe this works?

    library(dplyr)       
    mutate(datacsv, 
           fte = as.numeric(regmatches(tekst,regexpr("[[:digit:]]+\\.[[:digit:]]+", 
                                                     tekst))))

answered Mar 31 '16 at 14:56

denise

149
14

Adding columns that are results of (contextual) text mining operations

2 Answers2