Extract unknown words from a recurrent pattern

Question

I have a dataframe, called df, about articles from an important Spanish newspaper, La Vanguardia. The point is that I want to extract to which category each article belongs to. Looking carefully at the links provided I have realized that there is a recurrent pattern, here is an example:

status_link
 www.lavanguardia.com/de-moda/moda/infantil/20150928/54437701110/freya-fossaceco-it-girl-8-meses.html                                                                             
 www.lavanguardia.com/local/barcelona/20150927/54437681718/resultados-elecciones-barcelona.html                                                                                       
 www.lavanguardia.com/ciencia/20150928/54436850805/metastasis-cerebro-perfil-genetico-tratamiento-vall-d-hebron.html                                                                  
 videos.lavanguardia.com/politica/elecciones-catalanas/20150928/54437688908/gobierno-debe-reaccionar-antes-elecciones-generales.html                                                                                                                                    
 www.lavanguardia.com/economia/20150928/54436887975/audi-millones-coche-dieselgate.html                                                                                                  
 www.lavanguardia.com/vida/20150928/54437702392/claves-sobreexponer-hijos-internet.html                                                                                           
 www.lavanguardia.com/ciencia/20150928/54437643626/superluna-roja-de-sangre.html

Notice that the first word after the first slash corresponds to the category (in Spanish): Moda(fashion), Local, science, economics, life etc.

www.lavanguardia.com/de-moda/moda/infantil/20150928/54437701110/freya-fossaceco-it-girl-8-meses.html
www.lavanguardia.com/local/barcelona/20150927/54437681718/resultados-elecciones-barcelona.html

I'd like to extract each of these words and create a new variable next to each link assigning each link its corresponding category. I'm afraid I don't know how many categories there might be, but the pattern is quite recurrent. Notice also that sometimes the link varies, see 4th link posted here, however it still maintains the same structure.

I have been trying with all the suggested here, here and here but it's not very clear to me how to implement the package regex (or other that might be useful too) in this case.

I'd really appreciate any suggestion!

EDIT

I uploaded here the data so you can work with it. The column I am interested in is: status_link

https://www.dropbox.com/s/dot6iq9zhicxh1e/LaVanguardia_facebook_statuses.csv?dl=0

Why not providing your data in easy-to-paste form? – Roman Luštrik Jun 11 '16 at 17:06 — Roman Luštrik, Jun 11 '16 at 17:06
or without `regex` : `strsplit(status_link,"/")[[1]][2]` – mtoto Jun 11 '16 at 17:10 — mtoto, Jun 11 '16 at 17:10

rock321987 · Accepted Answer · 2016-06-12T09:22:28.627

You can try this (regex approach)

> x <- c("www.lavanguardia.com/de-moda/moda/infantil/20150928/54437701110/freya-fossaceco-it-girl-8-meses.html","www.lavanguardia.com/local/barcelona/20150927/54437681718/resultados-elecciones-barcelona.html","www.lavanguardia.com/ciencia/20150928/54436850805/metastasis-cerebro-perfil-genetico-tratamiento-vall-d-hebron.html","videos.lavanguardia.com/politica/elecciones-catalanas/20150928/54437688908/gobierno-debe-reaccionar-antes-elecciones-generales.html","www.lavanguardia.com/economia/20150928/54436887975/audi-millones-coche-dieselgate.html","www.lavanguardia.com/vida/20150928/54437702392/claves-sobreexponer-hijos-internet.html","www.lavanguardia.com/ciencia/20150928/54437643626/superluna-roja-de-sangre.html")
> y = sub("^[^\\/]+\\/([^\\/]+).*$", "\\1", x)
> data.frame(x, y)
                                                                                                                                    x        y
#1                                www.lavanguardia.com/de-moda/moda/infantil/20150928/54437701110/freya-fossaceco-it-girl-8-meses.html  de-moda
#2                                      www.lavanguardia.com/local/barcelona/20150927/54437681718/resultados-elecciones-barcelona.html    local
#3                 www.lavanguardia.com/ciencia/20150928/54436850805/metastasis-cerebro-perfil-genetico-tratamiento-vall-d-hebron.html  ciencia
#4 videos.lavanguardia.com/politica/elecciones-catalanas/20150928/54437688908/gobierno-debe-reaccionar-antes-elecciones-generales.html politica
#5                                              www.lavanguardia.com/economia/20150928/54436887975/audi-millones-coche-dieselgate.html economia
#6                                              www.lavanguardia.com/vida/20150928/54437702392/claves-sobreexponer-hijos-internet.html     vida
#7                                                     www.lavanguardia.com/ciencia/20150928/54437643626/superluna-roja-de-sangre.html  ciencia

As per your comments, try

> tmp <- read.csv("LaVanguardia_facebook_statuses.csv")
> sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])
#[1] "gente"         "series"        "deportes"      "television"    "vangdata"      "series"        "lacontra"      "participacion" "vida"          "comer"        
#[11] "vivo"          "local"         "politica"      "sucesos"       "natural"       "hemeroteca"    "natural"       "local"         "vida"          "fans"         
#[21] "television"    "viral"         "natural"       "deportes"      "vida"

This answer perfectly works if I insert the data manually but I can't do it with the column of the dataframe. I've tried all types of data conversion and I can't. I have uploaded the data here so you can work with it: https://www.dropbox.com/s/dot6iq9zhicxh1e/LaVanguardia_facebook_statuses.csv?dl=0 — adrian1121, Jun 12 '16 at 00:01
@adrian1121 the reason is because you have `http` etc. in your link..i am updating the answer — rock321987, Jun 12 '16 at 02:06

Extract unknown words from a recurrent pattern

1 Answers1

Linked