0

I want to get out the characters and numbers immediately after the very specific characters "data-permalink=" in a huge text file (50MB). The output should ideally be written in a simple (separate) text file looking something like this:

34k89 456ij 233a4 ...

the "data-permalink="" stays always the exact same (as usual in source codes), but the id within can be any combination of characters and numbers. It seemed simple at first, but since it is not at the start of a line, or the needed output is not a separate word I was not able to come up with a working solution at all in the required time. I am running out of time and need a solution or hints to this immediately, so any help is greatly appreciated

example of data in the source data file:

random stuff above ....

I would understand c++ or python the most, so such a solution using these languages would be nice.

I tried something like this:

#include <iostream>
#include <string>
#include <fstream>
using namespace std;

int main()
{
    ifstream in ("data.txt");
    if(in.fail())
    {
        cout<<"error";
    }
    else
    {
        char c;
        while(in.get(c))
        {
            if(c=="data-permalink=")
                cout<<"lol this is awesome"
            else
                cout<<" ";
        }
    }
    return 0;
}

It is just a random attempt to see if the structure works, nowhere near a solution. This prob. also gives u guys a good guess on how bad i am currently lmao.

user4581301
  • 33,082
  • 7
  • 33
  • 54
Sunlaser
  • 13
  • 1
  • Based on this post, it appears you are in need of a [good C++ book](https://stackoverflow.com/a/388282/4641116). (I did not downvote.) – Eljay Apr 01 '22 at 16:38
  • "Any of languages x,y ... will do": This kind of request almost certainly indicates a question that isn't narrowed down enough. (I did downvote.) As for posting code/an example of a input file preserving line breaks: use code blocks for this putpose. As for the question: read the data to a buffer, search for the substring, read the next buffer, check in the overlap, check in the new buffer, rinse and repeat... – fabian Apr 01 '22 at 16:42
  • When you have to do the amount of work you had to do to format this post you should strongly consider stopping and reading some instructions on how to use the site. If the folks behind Stack Overflow were really that stupid, no one would use the site, so there has got to be an easier way. Take the [tour], read [ask], and look around the question asking page for helpful links and formatting tips. – user4581301 Apr 01 '22 at 16:43
  • You are comparing a char to a string, that's not what you should do here. For the most basic scenario, I would suggest you to use std::getline instead (https://en.cppreference.com/w/cpp/string/basic_string/getline), and after that, you have to search for the text in the string you get from std::getline with, for example, std::string::find (https://en.cppreference.com/w/cpp/string/basic_string/find). My advice to you would be to first try to read the whole text file and print it to console via std::cout, in order to understand better what's going on. – kamshi Apr 01 '22 at 18:13

1 Answers1

0

Hm, basically 50MB is considered "small" nowadays. With taht small data, you can read the whole file into one std::stringand then do a linear search.

So, the algorithm is:

  1. Open files and check, if they could be opened
  2. Read complete file into a std::string
  3. Do a linear search for the string "data-permalink=""
  4. Remember the start position of the permalink
  5. Search for the closing "
  6. Use the std::strings substrfunction to create the output permalink string
  7. Write this to a file
  8. Goto 1.

I created a 70MB random test file with random data.

The whole procedure takes less than 1s. Even with slow linear search.

But caveat. You want to parse a HTML file. This will most probably not work, because of potential nested structures. For this you should use existing HTML parsers.

Anyway. Here is one of many possible solutions.

#include <iostream>
#include <fstream>
#include <string>
#include <random>
#include <iterator>
#include <algorithm>

std::string randomSourceCharacters{ " abcdefghijklmnopqrstuvwxyz" };
const std::string sourceFileName{ "r:\\test.txt" };
const std::string linkFileName{ "r:\\links.txt" };

void createRandomData() {
    std::random_device randomDevice;
    std::mt19937 randomGgenerator(randomDevice());
    std::uniform_int_distribution<> randomCharacterDistribution(0, randomSourceCharacters.size() - 1);
    std::uniform_int_distribution<> randomLength(10, 30);

    if (std::ofstream ofs{ sourceFileName }; ofs) {


        for (size_t i{}; i < 1000000; ++i) {

            const int prefixLength{ randomLength(randomGgenerator) };
            const int linkLength{ randomLength(randomGgenerator) };
            const int suffixLength{ randomLength(randomGgenerator) };

            for (int k{}; k < prefixLength; ++k)
                ofs << randomSourceCharacters[randomCharacterDistribution(randomGgenerator)];
            ofs << "data-permalink=\"";

            for (int k{}; k < linkLength; ++k)
                ofs << randomSourceCharacters[randomCharacterDistribution(randomGgenerator)];
            ofs << "\"";
            for (int k{}; k < suffixLength; ++k)
                ofs << randomSourceCharacters[randomCharacterDistribution(randomGgenerator)];

        }
    }
    else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for writing\n";
}


int main() {
    // Please uncomment if you want to create a file with test data
    // createRandomData();


    // Open source file for reading and check, if file could be opened
    if (std::ifstream ifs{ sourceFileName }; ifs) {

        // Open link file for writing and check, if file could be opened
        if (std::ofstream ofs{ linkFileName }; ofs) {

            // Read the complete 50MB file into a string
            std::string data(std::istreambuf_iterator<char>(ifs), {});

            const std::string searchString{ "data-permalink=\"" };
            const std::string permalinkEndString{ "\"" };

            // Do a linear search
            for (size_t posBegin{}; posBegin < data.length(); ) {

                // Search for the begin of the permalink
                if (posBegin = data.find(searchString, posBegin); posBegin != std::string::npos) {

                    const size_t posStartForEndSearch = posBegin + searchString.length() ;

                    // Search fo the end of the perma link
                    if (size_t posEnd = data.find(permalinkEndString, posStartForEndSearch); posEnd != std::string::npos) {

                        // Output result
                        const size_t lengthPermalink{ posEnd - posStartForEndSearch };
                        const std::string output{ data.substr(posStartForEndSearch, lengthPermalink) };
                        ofs << output << '\n';
                        posBegin = posEnd + 1;
                    }
                    else break;
                }
                else break;
            }
        }
        else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
    }
    else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
}

Edit

If you need unique links you may store the result in an std::unordered_set and then output later.

#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <algorithm>
#include <unordered_set>

const std::string sourceFileName{ "r:\\test.txt" };
const std::string linkFileName{ "r:\\links.txt" };

int main() {

    // Open source file for reading and check, if file could be opened
    if (std::ifstream ifs{ sourceFileName }; ifs) {

        // Open link file for writing and check, if file could be opened
        if (std::ofstream ofs{ linkFileName }; ofs) {

            // Read the complete 50MB file into a string
            std::string data(std::istreambuf_iterator<char>(ifs), {});

            const std::string searchString{ "data-permalink=\"" };
            const std::string permalinkEndString{ "\"" };

            // Here we will store unique results
            std::unordered_set<std::string> result{};

            // Do a linear search
            for (size_t posBegin{}; posBegin < data.length(); ) {

                // Search for the begin of the permalink
                if (posBegin = data.find(searchString, posBegin); posBegin != std::string::npos) {

                    const size_t posStartForEndSearch = posBegin + searchString.length();

                    // Search fo the end of the perma link
                    if (size_t posEnd = data.find(permalinkEndString, posStartForEndSearch); posEnd != std::string::npos) {

                        // Output result
                        const size_t lengthPermalink{ posEnd - posStartForEndSearch };
                        const std::string output{ data.substr(posStartForEndSearch, lengthPermalink) };
                        result.insert(output);

                        posBegin = posEnd + 1;
                    }
                    else break;
                }
                else break;
            }
            for (const std::string& link : result)
               ofs << link << '\n';

        }
        else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
    }
    else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
}
A M
  • 14,694
  • 5
  • 19
  • 44
  • Thank you very much! It worked really well. I applied it to my own file and it listed some output strings like 6 times in a row. It is due to the fact that it is actually in the in the file right? I at least don't see how that could happen in the program you proposed. Thank you very much !!!! – Sunlaser Apr 02 '22 at 21:21
  • it apparently is in the file. How would could make it such, that it only writes the string once, and does not write it to the file when it already read such a such a string? It shows me something like this: 2yo8vj 2yo8vj 2yodvm 2yodvm 2yodvm. – Sunlaser Apr 02 '22 at 21:38
  • You may store the results in a container, then eliminate the doubles with std::unique or store the links in a container that does not allow doubles, like a `std::unordered_set` and then output later. See my edit in the answer – A M Apr 04 '22 at 07:44