I have a dataframe, called df, about articles from an important Spanish newspaper, La Vanguardia. The point is that I want to extract to which category each article belongs to. Looking carefully at the links provided I have realized that there is a recurrent pattern, here is an example:
status_link
www.lavanguardia.com/de-moda/moda/infantil/20150928/54437701110/freya-fossaceco-it-girl-8-meses.html
www.lavanguardia.com/local/barcelona/20150927/54437681718/resultados-elecciones-barcelona.html
www.lavanguardia.com/ciencia/20150928/54436850805/metastasis-cerebro-perfil-genetico-tratamiento-vall-d-hebron.html
videos.lavanguardia.com/politica/elecciones-catalanas/20150928/54437688908/gobierno-debe-reaccionar-antes-elecciones-generales.html
www.lavanguardia.com/economia/20150928/54436887975/audi-millones-coche-dieselgate.html
www.lavanguardia.com/vida/20150928/54437702392/claves-sobreexponer-hijos-internet.html
www.lavanguardia.com/ciencia/20150928/54437643626/superluna-roja-de-sangre.html
Notice that the first word after the first slash corresponds to the category (in Spanish): Moda(fashion), Local, science, economics, life etc.
www.lavanguardia.com/de-moda/moda/infantil/20150928/54437701110/freya-fossaceco-it-girl-8-meses.html
www.lavanguardia.com/local/barcelona/20150927/54437681718/resultados-elecciones-barcelona.html
I'd like to extract each of these words and create a new variable next to each link assigning each link its corresponding category. I'm afraid I don't know how many categories there might be, but the pattern is quite recurrent. Notice also that sometimes the link varies, see 4th link posted here, however it still maintains the same structure.
I have been trying with all the suggested here, here and here but it's not very clear to me how to implement the package regex (or other that might be useful too) in this case.
I'd really appreciate any suggestion!
EDIT
I uploaded here the data so you can work with it. The column I am interested in is: status_link
https://www.dropbox.com/s/dot6iq9zhicxh1e/LaVanguardia_facebook_statuses.csv?dl=0