7

I have these few lines of code:

QFile file("h:/test.txt");
file.open(QFile::ReadOnly | QFile::Text);
QTextStream in(&file);

bool found = false;
uint pos = 0;

do {
    QString temp = in.readLine();
    int p = temp.indexOf("something");
    if (p < 0) {
        pos += temp.length() + 1;
    } else {
        pos += p;
        found = true;
    }
} while (!found && !in.atEnd());

in.seek(0);
QString text = in.read(pos);
cout << text.toStdString() << endl;

The idea is to search a text file for a specific char sequence, and when found, load the file from the beginning to the occurrence of the searched text. The input I used for testing was:

this is line one, the first line
this is line two, it is second
this is the third line
and this is line 4
line 5 goes here
and finally, there is line number 6

And here comes the strange part - if the searched string is on any of lines save for the last, I get the expected behavior. It works perfectly fine.

BUT if I search for a string that is on the last line 6, the result is always 5 characters short. If it was the 7th line, the result would be 6 characters short and so on, when the searched string is on the last line, the result is always lineNumber - 1 characters shorter.

So, is this a bug or I am missing something obvious?

EDIT: Just to clarify, I am not asking for alternative ways to do this, I am asking why do I get this behavior.

dtech
  • 47,916
  • 17
  • 112
  • 190
  • 2
    I would guess something to do with line endings, are you on a Window platform? If so then your line endings may be two bytes each. – john Apr 06 '13 at 11:22
  • @john - no, I get expected result for previous lines, and every line has a `\n` - I should get problem for every line. If I adjust the compensation for it to 2, I get bad result for the previous lines. – dtech Apr 06 '13 at 11:24
  • 1
    Maybe you have a file with mixed line endings? In any case your approach is inherently risky. I don't know about QTextStream but in the equivalent standard C++ your code would not have well defined behaviour. I would just read the entire file into a string and manipulate it from there. – john Apr 06 '13 at 11:38
  • @john - that is what I typically do, but the requirement here is that the file might be pretty big and not entirely needed, that is why I want to find the "terminating" string and load only from the beginning to it. – dtech Apr 06 '13 at 11:45
  • BTW I also checked, the actual file size confirms the line ending is a single byte. – dtech Apr 06 '13 at 11:50
  • If the usual use case is that the string is found from the file OR if the files are not huge, it might be better to store the lines you have already read instead of reading them again. –  Apr 06 '13 at 13:15
  • @Roku - yes, but there is a requirement to minimize concatenation operations. And besides, reading the file again comes at no expense, at least in Windows, because the file is already cached in memory. – dtech Apr 06 '13 at 13:21
  • You could store the lines to QList and then print them without any concatenations. –  Apr 06 '13 at 14:03
  • @Roku - I did exactly this with a QStringList but still would like to investigate this matter. – dtech Apr 06 '13 at 14:04
  • Please, review my answer: http://stackoverflow.com/a/16100974/1035613 – Dmitry Sazonov Apr 20 '13 at 08:12

5 Answers5

4

Obviously you get this behaviour because readLine() skips cursor by line size with line delimiter chars (either LF CRLF or CR depending on file). Buffer you get from this method does not contans those symbols, so you aren't taking these chars in your position calculations.

The solution is to read not by lines but by buffer. Here is your code, modified:

QFile file("h:/test.txt");
file.open(QFile::ReadOnly | QFile::Text);
QTextStream in(&file);

bool found = false;
uint pos = 0;
qint64 buffSize = 64; // adjust to your needs

do {
    QString temp = in.read(buffSize);
    int p = temp.indexOf("something");
    if (p < 0) {
        uint posAdj = buffSize;
        if (temp.length() < buffSize)
            posAdj = temp.length();
        pos += posAdj;
    } else {
        pos += p;
        found = true;
    }
} while (!found && !in.atEnd());

in.seek(0);
QString text = in.read(pos);
cout << text.toStdString() << endl;

EDIT

The code above contains error due to word might be splitted by buffer. Here is a sample input that breaks stuff (assuming we seach for keks):

test test test test test test
test test test test test test  keks
test test test test test test
test test test test test test
test test test test test test
test test test test test test

Solution

Here is complete code what works great with all inputs I tried:

#include <QFile>
#include <QTextStream>
#include <iostream>


int findPos(const QString& expr, QTextStream& stream) {
    if (expr.isEmpty())
        return -1;

    // buffer size of same length as searched expr should be OK to go
    qint64 buffSize = quint64(expr.length());

    stream.seek(0);
    QString startBuffer = stream.read(buffSize);
    int pos = 0;

    while(!stream.atEnd()) {
        QString cycleBuffer = stream.read(buffSize);
        QString searchBuffer = startBuffer + cycleBuffer;
        int bufferPos = searchBuffer.indexOf(expr);
        if (bufferPos >= 0)
            return pos + bufferPos + expr.length();
        pos += cycleBuffer.length();
        startBuffer = cycleBuffer;
    }

    return pos;
}

int main(int argc, char *argv[])
{
    Q_UNUSED(argc);
    Q_UNUSED(argv);

    QFile file("test.txt");
    file.open(QFile::ReadOnly | QFile::Text);
    QTextStream in(&file);

    int pos = findPos("keks", in);

    in.seek(0);
    QString text = in.read(pos);
    std::cout << text.toUtf8().data() << std::endl;
}
dant3
  • 966
  • 9
  • 26
4

When you search on the last line, you read all of the input stream - in.atEnd() returns true. It looks like it somehow corrupts either file or text stream, or sets them out of sync, so seek is no longer valid.

If you replace

in.seek(0);
QString text = in.read(pos);
cout << text.toStdString() << endl;

by

QString text;
if(in.atEnd())
{
    file.close();
    file.open(QFile::ReadOnly | QFile::Text);
    QTextStream in1(&file);
    text = in1.read(pos);
}

else
{
    in.seek(0);
    text = in.read(pos);
}
cout << text.toStdString().c_str() << endl;

It will work as expected. P.S. There might be some cleaner solution then re-opening the file, but the problem definitely comes from reaching the end of both stream and file and trying to operate on them after...

Ilya Kobelevskiy
  • 5,245
  • 4
  • 24
  • 41
3

You know the difference between windows and *nix line endings (\r\n vs \n). When you open file in text mode you should know that all sequence of \r\n are transtaled to \n.

Your mistake in original code that you are trying to calculate offset of skipped line, but you don't know it exact length of line in text file.

length = number_of_chars + number_of_eol_chars
where number_of_chars == QString::length()
and number_of_eol_chars == (1 if \n) or (2 if \r\n)

You could not detect number_of_eol_chars without raw access to file. And you don't use it in your code, because you open file as text, but not as binary. So error in your code, that you had hardcoded number_of_eol_chars with 1, instead of detecting it. For each line in windows text files (with \r\n eol) you will get mistake in pos for each skipped line.

Fixed code:

#include <QFile>
#include <QTextStream>

#include <iostream>
#include <string>


int main(int argc, char *argv[])
{
    QFile f("test.txt");
    const bool isOpened = f.open( QFile::ReadOnly | QFile::Text );
    if ( !isOpened )
        return 1;
    QTextStream in( &f );

    const QString searchFor = "finally";

    bool found = false;
    qint64 pos = 0;

    do 
    {
        const qint64 lineStartPos = in.pos();
        const QString temp = in.readLine();
        const int ofs = temp.indexOf( searchFor );
        if ( ofs < 0 )
        {
            // Here you skip line and increment pos on exact length of line
            // You shoud not hardcode "1", because it may be "2" (\n or \r\n)
            const qint64 length = in.pos() - lineStartPos;
            pos += length;
        }
        else
        {
            pos += ofs;
            found = true;
        }

    } while ( !found && !in.atEnd() );

    in.seek( 0 );
    const QString text = in.read( pos );

    std::cout << text.toStdString() << std::endl;

    return 0;
}
Dmitry Sazonov
  • 8,801
  • 1
  • 35
  • 61
  • "For each line in windows text files (with \r\n eol) you will get mistake in pos for each skipped line" - but I don't get a mistake in position for each skipped line, I only get the wrong position when the searched text is on the last line, otherwise it is good. – dtech Apr 20 '13 at 11:39
  • Strange... Could you provide your original sample (source + text file, as .zip)? Because code that you posted is working as expected for me (as i described in my post - each line gives -1 position mistake), if text file uses \r\n eol. – Dmitry Sazonov Apr 21 '13 at 09:51
2

I'm not entirely sure why you're seeing this behavior but I'd suspect it's related to line endings. I tried your code and I only saw the last line behavior when the file had CRLF line endings AND there was no new line (CRLF) at the end of the file. So yes, weird. If the file had LF line endings then it always worked as expected.

With that said, it's probably not a good idea to keep track of the position by adding + 1 at the end of each line because you won't know if your source file was CRLF or LF and QTextStream will always strip the line endings. Here's a function that should work better. It builds up the output string line by line and I haven't seen any weird behavior with it:

void searchStream( QString fileName, QString searchStr )
{
    QFile file( fileName );
    if ( file.open(QFile::ReadOnly | QFile::Text) == false )
        return;

    QString text;
    QTextStream in(&file);
    QTextStream out(&text);

    bool found = false;

    do {
        QString temp = in.readLine();
        int p = temp.indexOf( searchStr );
        if (p < 0) {
            out << temp << endl;
        } else {
            found = true;
            out << temp.left(p);
        }
    } while (!found && !in.atEnd());

    std::cout << text.toStdString() << std::endl;
}

It doesn't keep track of the position in the original stream, so if you really wanted a position then I'd recommend using QTextStream::pos() as it will be accurate whether the file is CRLF or LF.

Cutterpillow
  • 1,717
  • 13
  • 32
2

The QTextStream.read() method takes as a parameter the maximum number of characters to read, not a file position. In many environments, the position is not a simple character count: VMS and Windows both come to mind as exceptions. VMS imposes a record structure which uses many hidden bits of metadata within the file and file positions are "magic cookies"

The only filesystem-independent way to get the right value is to use QTextStream::pos() when the file is already positioned to the correct place, and then keep reading until the file position returns to the same location.

(Redacted because there was an initially unspecified requirement prohibiting multiple allocations to buffer the text.)
However, given the program's requirements, there is no sense to rereading the first part of the file. Start saving text at the beginning and stop when the string is found:

QString out;
do {
    QString temp = in.readLine();
    int p = temp.indexOf("something");
    if (p < 0) {
        out += temp;
    } else {
        out += temp.substr(pos);  //not sure of the proper function/parameters here
        break;
    }
} while (!in.atEnd());

cout << out.toStdString() << endl;

Since you are on Windows, text file processing is translating '\r\n' into '\n' and that is causing a mismatch in file positioning vs. character counting. There are several ways to work around this, but perhaps the simplest is simply to process the file as binary (that is, not "text" by dropping the text mode) to prevent the translation:

file.open(QFile::ReadOnly);

Then the code should work as expected. It doesn't do any harm to output \r\n in Windows, but sometimes can cause nuisance displays when using Windows' text utilities. If that is important, search and replace \r\n with \n once the text is in memory.

wallyk
  • 56,922
  • 16
  • 83
  • 148
  • The requirement was to avoid reallocation. Those concatenation operations may become increasingly heavy for big input files. Storing and summing all the lines at once is not desired either. – dtech Apr 15 '13 at 00:25
  • @ddriver: Can you be assured that the program will not be required to run on an opaque filesystem? That is, will it run only on Linux, Unix, etc.? – wallyk Apr 15 '13 at 03:19
  • The issue of different formats of representing text on different platforms is recognized. It is not what the question is about. My text input is exactly the number of characters + 1 byte for each EOL. I want to know why I get the expected behavior when searching on all lines except the last one. – dtech Apr 15 '13 at 03:36
  • @ddriver: What platform is it running on? – wallyk Apr 15 '13 at 03:40
  • @ddriver: I have amended my answer accordingly. – wallyk Apr 15 '13 at 04:35