0

I have a corpus of many documents, containing long texts. I want to tokenize this corpus for further analysis, however, the texts contain irrelevant data within parentheses (typically references, such as:"(example example)"), so I want to delete them. I have found methods on stackoverflow for text objects, however, I don't know how can I apply this for a corpus (words between the parentheses would be considered as independent tokens and not removed by regex?). I've figured out that I should do it before I remove punctuation (as the latter also removes parentheses).

Could you help me with this? Thank you in advance!

I only reached the regex: "\(.\)"

  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 21 '21 at 19:12
  • Thank you, I'm not really experienced with R, I try to create one – armentieres Jul 21 '21 at 19:15

1 Answers1

1

You can remove all texts in brackets using gsub(). As you plan to remove the punctuation in a next step, you can replace them with ., just to indicate where something was taken (if you need to debug the pipeline) or you can replace them with an empty string "".

Your regex would not work. You need to escape the brackets with double back-slashes and you will want to remove multiple, but as few as possible, characters. You'll need the regex *? for the contents of the brackets:

corp = c("This is an example (or demonstration) of replacing things in brackets",
         "Just use gsub (a function in base) to remove (or better replace) these elements")

corp = gsub("\\(.*?\\)",".",corp)

The example above would result in the vector:

> corp
[1] "This is an example . of replacing things in brackets"
[2] "Just use gsub . to remove . these elements"     

Depending on the package you use for your corpus, you can do this with the character vector before converting it to a corpus or you can use specific mapping functions (e.g. tm_map() in tm) to apply it to all texts.

Martin Wettstein
  • 2,771
  • 2
  • 9
  • 15
  • Thank you very much Martin! Your proposed solution seems to be working and this was a great help for me! I cannot upvote you yet as a new user, but I would :) – armentieres Jul 22 '21 at 19:33