0

Is there an easier way to simply remove or filter out all non-alphabetical characters in C++?

I am loading in a file to sort each word into a dictionary and I want the dictionary to only contain single whole words, no spaces and no-non alphabetical characters.

//Read the entire file (stream) into QString variable "file"
            QString file = in.readAll();
            QStringList NewList = file.split(QRegExp("[\\s\\,\\!\\?\\...\\;\\:\\-\\[\\]\\{\\}\\+\\-\\=\\_\\<\\>\\]QString::SkipEmptyParts);

This method does work however its very un-efficient to simply list all of the non-alphabetical characters.

Can somebody show me a quicker method for doing this?

I am certain this is not the best way...

user_1
  • 45
  • 1
  • 3
  • 8

2 Answers2

1

Using a regular expression is the right way, but use it to find words and not where to split. Then your code become more expressive and less error prone. Further, use Qt5's new QRegularExpression class because of its better performance.

As for the regular expression: consult any tutorial and read about the meaning of \w and \b. As an example where this is going (\b is not needed but for demonstration purposes I put it there...):

QString data = "Lorem ipsum dolor sit amet, consetetur - sadipscing - elitr. Stet clita kasd gubergren!";

QRegularExpression rx("\\b(\\w+)\\b");
QRegularExpressionMatchIterator matches = rx.globalMatch(data);
while (matches.hasNext()) {
    QRegularExpressionMatch match = matches.next();
    qDebug() << match.captured(1);
}
Lorenz
  • 503
  • 2
  • 9
0

For your specific case I would find out if the file has a pre-determined format first rather than pulling characters based on a regex, which will probably be more inefficient. eg. Delimiting, word per line, etc.

But, a simpler form of your RegEpx would likely be:

QStringList NewList = file.split(QRegExp("\\W", QString::SkipEmptyParts);   

Although this doesn't include things like apostrophe and accents.

Nick D
  • 1
  • \W (a capital W) is short for [\^w], which is all "non-word" characters [A-Za-z0-9_]. So it will split on spaces as well. – Nick D Apr 30 '17 at 12:44