Read text file step-by-step

Question

I have a file which has text like this:

#1#14#ADEADE#CAH0F#0#0.....

I need to create a code that will find text that follows # symbol, store it to variable and then writes it to file WITHOUT # symbol, but with a space before. So from previous code I will get:

1 14 ADEADE CAH0F 0 0......

I first tried to did it in Python, but files are really big and it takes a really huge time to process file, so I decided to write this part in C++. However, I know nothing about C++ regex, and I'm looking for help. Could you, please, recommend me an easy regex library (I don't know C++ very well) or the well-documented one? It would be even better, if you provide a small example (I know how to perform transmission to file, using fstream, but I need help with how to read file as I said before).

Can I ask why you want to use regex? There's lots of other ways to parse strings and regex seems pretty intense for something as simple as this... — Zann Anderson, Oct 05 '11 at 17:43
Not a fan of [`string::replace`](http://www.cplusplus.com/reference/string/string/replace/)? — Brad Christie, Oct 05 '11 at 17:45
`that follows # symbol, store it to variable and then writes it to file WITHOUT # symbol, but with a space before` - any specific reason you want that temporary variable, other than just over engineering and solliciting buffer overruns and DoS attacks? — sehe, Oct 05 '11 at 17:50
I've always thought that there is no more powerfull method to edit text as using regex. Now I know I was wrong. :-) — ghostmansd, Oct 05 '11 at 18:12

score 4 · Accepted Answer · answered Oct 05 '11 at 17:46

4

This looks like a job for std::locale and his trusty sidekick imbue:

#include <locale>
#include <iostream>


struct hash_is_space : std::ctype<char> {
  hash_is_space() : std::ctype<char>(get_table()) {}
  static mask const* get_table()
  {
    static mask rc[table_size];
    rc['#'] = std::ctype_base::space;
    return &rc[0];
  }
};

int main() {
  using std::string;
  using std::cin;
  using std::locale;

  cin.imbue(locale(cin.getloc(), new hash_is_space));

  string word;
  while(cin >> word) {
    std::cout << word << " ";
  }
  std::cout << "\n";
}

answered Oct 05 '11 at 17:46

Robᵩ

163,533
20
239
308

very good thought; It just clicked with me, but then you already posted it – sehe Oct 05 '11 at 17:51
Just wondering, why not use a RegEx library instead? This is kind of cool, though. – Ehtesh Choudhury Oct 05 '11 at 17:51
@Shurane So I don't have to read the entire input file into a single `string` first. – Robᵩ Oct 05 '11 at 17:53

score 1 · Answer 2 · answered Oct 05 '11 at 17:47

1

IMO, C++ is not the best choice for your task. But if you have to do it in C++ I would suggest you have a look at Boost.Regex, part of the Boost library.

answered Oct 05 '11 at 17:47

Chris

953
11
16

*"...now you have two problems..."* – dmckee --- ex-moderator kitten Oct 05 '11 at 18:00
Thanks, I'll look at it. I've heard that Boost one of the best libraries for C++, and some usefull functions from it will be in the following versions of C++. – ghostmansd Oct 05 '11 at 18:14

Ehtesh Choudhury · Answer 3 · 2011-10-05T18:17:45.363

1

If you are on Unix, a simple sed 's/#/ /' <infile >outfile would suffice.

Sed stands for 'stream editor' (and supports regexes! whoo!), so it would be well-suited for the performance that you are looking for.

edited Oct 05 '11 at 18:17

answered Oct 05 '11 at 17:58

Ehtesh Choudhury

7,452
5
42
48

I use Linux, but I think users of Windows won't agree if I will use sed in my application. :-) – ghostmansd Oct 05 '11 at 18:15
But `sed` is more versatile! Curious, what does `tr` do that `sed` cannot easily? – Ehtesh Choudhury Oct 05 '11 at 18:16
@ghostmansd you can embed it and no one will be able to tell the difference :P – Ehtesh Choudhury Oct 05 '11 at 18:20
Sed is fine. I like and use sed, but invoking the full power of regexp for a straight ahead one-to-one substitution is silly. – dmckee --- ex-moderator kitten Oct 06 '11 at 03:02

score 0 · Answer 4 · edited May 23 '17 at 12:27

Alright, I'm just going to make this an answer instead of a comment. Don't use regex. It's almost certainly overkill for this task. I'm a little rusty with C++, so I'll not post any ugly code, but essentially what you could do is parse the file one character at a time, putting anything that wasn't a # into a buffer, then writing it out to the output file along with a space when you do hit a #. In C# at least two really easy methods for solving this come to mind:

StreamReader fileReader = new StreamReader(new FileStream("myFile.txt"),
                              FileMode.Open);
string fileContents = fileReader.ReadToEnd();
string outFileContents = fileContents.Replace("#", " ");
StreamWriter outFileWriter = new StreamWriter(new FileStream("outFile.txt"),
                                 Encoding.UTF8);
outFileWriter.Write(outFileContents);
outFileWriter.Flush();

Alternatively, you could replace

string outFileContents = fileContents.Replace("#", " ");

With

StringBuilder outFileContents = new StringBuilder();
string[] parts = fileContents.Split("#");
foreach (string part in parts)
{
    outFileContents.Append(part);
    outFileContents.Append(" ");
}

I'm not saying you should do it either of these ways or my suggested method for C++, nor that any of these methods are ideal - I'm just pointing out here that there are many many ways to parse strings. Regex is awesome and powerful and may even save the day in extreme circumstances, but it's not the only way to parse text, and may even destroy the world if used for the wrong thing. Really.

If you insist on using regex (or are forced to, as in for a homework assignment), then I suggest you listen to Chris and use Boost.Regex. Alternatively, I understand Boost has a good string library as well if you'd like to try something else. Just look out for Cthulhu if you do use regex.

First, ghostmansd doesn't want to read the entire file into a string. Also, regex is pretty simple. You can see it as a transformation on text. It's only considered terrible if you're using it for parsing languages, like HTML. You can't express those using regex so any attempt to do so will fail. — Ehtesh Choudhury, Oct 05 '11 at 18:59
I wasn't necessarily saying that he should read the entire file into a string - just that there were other options. As far as simplicity with regex, I've no argument there - I was merely pointing out that for a simple case like this there are other alternatives, and attempting to warn against seeing every problem as a nail with regex as the hammer. I appreciate your clarification though, @Shurane. — Zann Anderson, Oct 05 '11 at 19:40

score 0 · Answer 5 · answered Oct 05 '11 at 18:38

You've left out one crucial point: if you have two (or more) consecutive #s in the input, should they turn into one space, or the same number of spaces are there are #s?

If you want to turn the entire string into a single space, then @Rob's solution should work quite nicely.

If you want each # turned into a space, then it's probably easiest to just write C-style code:

#include <stdio.h>

int main() { 
    int ch;
    while (EOF!=(ch=getchar()))
        if (ch == '#')
            putchar(' ');
        else
            putchar(ch);
    return 0;
}

eyquem · Answer 6 · 2011-10-05T19:45:47.873

So, you want to replace each ONE character '#' with ONE character ' ' , right ?

Then it's easy to do since you can replace any portion of the file with string of exactly the same length without perturbating the organisation of the file.
Repeating such a replacement allows to make transformation of the file chunk by chunk; so you avoid to read all the file in memory, which is problematic when the file is very big.

Here's the code in Python 2.7 .

Maybe, the replacement chunk by chunk will be unsifficient to make it faster and you'll have a hard time to write the same in C++. But in general, when I proposed such codes, it has increased the execution's time satisfactorily.

def treat_file(file_path, chunk_size):
    from os import fsync

    from os.path import getsize
    file_size = getsize(file_path)

    with open(file_path,'rb+') as g:
        fd = g.fileno() # file descriptor, it's an integer

        while True:
            x = g.read(chunk_size)
            g.seek(- len(x),1)
            g.write(x.replace('#',' '))
            g.flush()
            fsync(fd)
            if g.tell() == file_size:
                break

Comments:

open(file_path,'rb+')

it's absolutely obligatory to open the file in binary mode 'b' to control precisely the positions and movements of the file's pointer;
mode '+' is to be able to read AND write in the file

fd = g.fileno()

file descriptor, it's an integer

x = g.read(chunk_size)

reads a chunk of size chunk_size . It would be tricky to give it the size of the reading buffer, but I don't know how to find this buffer's size. Hence a good idea is to give it a power of 2 value.

g.seek(- len(x),1)

the file's pointer is moved back to the position from which the reading of the chunk has just been made. It must be len(x), not chunk_size because the last chunk read is in general less long than chink_size

g.write(x.replace('#',' '))

writes on the same length with the modified chunk

g.flush()
fsync(fd)

these two instructions force the writing, otherwise the modified chunk could remain in the writing buffer and written at uncontrolled moment

if g.tell() >= file_size:  break

after the reading of the last portion of file , whatever is its length (less or equal to chunk_size), the file's pointer is at the maximum position of the file, that is to say file_size and the program must stop

.

In case you would like to replace several consecutive '###...' with only one, the code is easily modifiable to respect this requirement, since writing a shortened chunk doesn't erase characters still unread more far in the file. It only needs 2 files's pointers.

Read text file step-by-step

6 Answers6