0

I have a big text file (2GB) that contains couple of books. I want to create a (**char) that contains each word of the whole text file. But firstly i pass all the text file data in a HUGE string, THEN making the **char variable

the problem is that it takes TOO long(hours) for the getline() loop to end.I ran it for 30 mins and the program read 500.000 lines. The whole file is 43.000.000 lines

int main (){
ifstream book;
string sbook,str;
book.open("gutenberg.txt"); // the huge file
cout<<"Reading the file ....."<<endl;
while(!book.eof()){
    getline(book,sbook);//passing the line as a string to sbook
    if(str.empty()){
        str= sbook;
    }
    else
        str= str + " " + sbook;//apend sbook to another string until the file closes

}//I never managed to get out of this loop
cout<<"Done reading the file."<<endl;
cout<<"Removal....."<<endl;
removal(str);//removes all puncuations and makes each upperccase letter to a lowercase
cout<<"done removal"<<endl;
cout<<"Removing doublewhitespaces...."<<endl;
int whitespaces=removedoublewhitespace(str);//removes excess whitespaces leaving only one whitespace within each word
                                            //and returns the number of all the whitespaces
cout<<"doublewhitespaces removed."<<endl;
cout<<"initiating leksis....."<<endl;
char **leksis=new char*[whitespaces+1];//whitespase+1 is how many words are left in the file
for(int i=0;i<whitespaces+1;i++){
    leksis[i]= new char[30];
}
cout<<"done initiating leksis."<<endl;
int y=0,j=0;
cout<<"constructing leksis,finding plithos...."<<endl;
for(int i=0;i<str.length();i++){
    if(isspace(str[i])){;
        y++;
        j=0;
        leksis[y][j]=' ';
        j++;
    }
    else{
        leksis[y][j]=str[i];
        j++;
    } 
}
cout<<"Done constructing leksis,finding plithos...."<<endl;

removal() function

void removal(string &s) {
for (int i = 0, len = s.size(); i < len; i++)
{
    char c=s[i];
    if(isupper(s[i])){
        s[i]=tolower(s[i]);
    }
    int flag=ispunct(s[i]);
    if (flag){
        s.erase(i--, 1);
        len = s.size();
    }
}

}

removedoublewhitespace() function :

int removedoublewhitespace(string &str){
int wcnt=0;
for(int i=str.size()-1; i >= 0; i-- )
{
    if(str[i]==' '&&str[i]==str[i-1]) //added equal sign
    {
        str.erase( str.begin() + i );
    }
}
for(int i=0;i<str.size();i++){
    if(isspace(str[i])){
        wcnt++;
    }
}
return wcnt;

}

  • 1
    make sure you are doing an optimized build. – pm100 Jun 09 '22 at 18:12
  • 2
    `while(!book.eof()){` can cause you problems unrelated to performance: [https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons) – drescherjm Jun 09 '22 at 18:13
  • 4
    Consider [copying the file directly to your string](https://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring) instead of appeanding an existing string 43 million times. Then just replace newline characters with space characters. – Drew Dormann Jun 09 '22 at 18:16
  • 2
    Do you really need the entire 2GB in memory at once? FYI, if your program goes beyond the memory size allocated by the OS, the OS will page your data onto the hard drive. Please consider using memory-mapped files or processing the files in blocks (chunks). – Thomas Matthews Jun 09 '22 at 18:32
  • 3
    I'd use *file mapping* and `std::string_view` – MatG Jun 09 '22 at 18:35
  • If you must have all the data in memory at once, consider allocating an array in dynamic memory and using the block read method. There is an overhead for every I/O transaction; so transfer as much as you can per transaction. – Thomas Matthews Jun 09 '22 at 18:35
  • When you say you need a `(char **)` that contains each word of the file, do you mean a single copy of each unique word in the file, or a list of words in the file, in order? (e.g. if your file contains the 25,000 instances of the word "the", will you want to have 25,000 copies of the word "the" in memory or just one?) – Jeremy Friesner Jun 09 '22 at 18:40
  • @OP `void removal(string &s) { s.erase(std::remove_if(s.begin(), s.end(), [](unsigned char c){ return isspace(c) || ispunct(c);}, s.end()); }` -- That one line of code literally describes what is being done, unlike your set of `for` loops, where someone has to figure out what is being done, as well as test to see if it actually does the job. – PaulMcKenzie Jun 09 '22 at 18:59
  • Also, you should be aware of calling `erase` so many times while looping. You should structure your code so as to call erase a minimum number of times (like my above comment, where `erase` is called once). This is especially true when you're not erasing at the end of the string. All of the movement of the data for the string has to be done each and every time erase is called. – PaulMcKenzie Jun 09 '22 at 19:05
  • Also, your `int removedoublewhitespace(string &str){` has a bug if `str` is empty. Also, I didn't check if your code works if there is more than 2 consecutive spaces. Maybe it works, maybe it doesn't, you need to verify this. And to my last point -- maybe it would be faster to simply build a new string that doesn't have double spaces, instead (and again) of `erase`-ing so many times in a loop. Then assign the non-double-spaced string to `str`. – PaulMcKenzie Jun 09 '22 at 19:08
  • [See this](https://godbolt.org/z/cE4WGT1zP). No calls to `erase` in the middle of the loop. – PaulMcKenzie Jun 09 '22 at 19:22

1 Answers1

2

this loop

while(!book.eof()){
    getline(book,sbook);//passing the line as a string to sbook
    if(str.empty()){
        str= sbook;
    }
    else
        str= str + " " + sbook;

is hugely inefficient. Concatenating an huge string like that is terrible. If you must have the whole file in memory at once then put it in a linked list of strings, one for each line. Or a vector of strings, thats also a huge chunk of memory but it will be allocated more efficiently

pm100
  • 48,078
  • 23
  • 82
  • 145
  • 1
    also that eof condition is wrong https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons – pm100 Jun 09 '22 at 18:17
  • 3
    Even just using `reserve` and `+=` will probably make a *massive* difference. – David Schwartz Jun 09 '22 at 18:22