I have a corpus of many documents, containing long texts. I want to tokenize this corpus for further analysis, however, the texts contain irrelevant data within parentheses (typically references, such as:"(example example)"), so I want to delete them. I have found methods on stackoverflow for text objects, however, I don't know how can I apply this for a corpus (words between the parentheses would be considered as independent tokens and not removed by regex?). I've figured out that I should do it before I remove punctuation (as the latter also removes parentheses).
Could you help me with this? Thank you in advance!
I only reached the regex: "\(.\)"