0

I have an input file that contains numerical input parameters mixed with otherwise irrelevant text.

E.g.


//file orig_data.txt
Number of customers
25
Customer_id Customer_SSN
1           1234 
2           3456
....
25          0123 

All input files have this same format although the number of customers may be different. The above example has 25 customers, another input file may have number of customers to be 35 and hence 35 rows worth of data (id and SSN)

One way for me would be to strip the text out manually and create a modified input data that just has the numerical data.


//file modified_data.txt
25
1           1234 
2           3456
....
25          0123 

With this type of input file (modified_data.txt), I can have the following code to read the data in.

void readdata() {
        FILE* fp = fopen("modified_data.txt","r");
        fscanf(fp, "%d", &no_cust);
        int* ids  = new int[no_cust+1];
        int* ssns = new int [no_cust+1];
        for(int i = 1; i <= no_cust; i++)
                   fscanf(fp, "%d %d", &ids[i], &ssns[i]);
        //other stuff
}

Is there a way to suitably work with orig_data.txt itself and have the function that reads this file skip over the first and third lines that contain the unneeded text?

MikeCAT
  • 73,922
  • 11
  • 45
  • 70
Tryer
  • 3,580
  • 1
  • 26
  • 49
  • 1
    If it's always the same lines then just read those with `fgets` and throw away the data. – kaylum May 06 '21 at 12:06
  • 1
    Be careful when mixing `fscanf()` and `fgets()`: [c - fgets doesn't work after scanf - Stack Overflow](https://stackoverflow.com/questions/5918079/fgets-doesnt-work-after-scanf) – MikeCAT May 06 '21 at 12:08

2 Answers2

1

You should just use the return value of fscanf, because it contains the number of items successfully converted.

So you could apply minimal changes to your proposed code:

void readdata() {
        FILE* fp = fopen("modified_data.txt","r");
        for (;;) {
            if (1 == fscanf(fp, "%d", &no_cust)) break; // will skip over non numeric data
        }
        int* ids  = new int[no_cust+1];
        int* ssns = new int [no_cust+1];
        for(int i = 1; i <= no_cust; i++) {
            for(;;) {
                   if (2 == fscanf(fp, "%d %d", &ids[i], &ssns[i])) break;
            }
        //other stuff
}

But this is really minimal, because it will not try to provide error recovery nor even relevant message facing incorrect input data...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
1

You tagged with C++, so why drop to the level of C?

Standard Library

IOStreams has the goods, but the interface is cumbersome and the implementations not very efficient:

struct Customer {
    int       id;
    uintmax_t SSN;
};

auto parse_customers(std::istream& data) {
    std::vector<Customer> customers;
    std::string line;

    while (getline(data, line))
        if (line == "Number of customers")
            break;

    auto skipline = [](std::istream& is, int n = 1) -> auto& {
        // o horrors, iostreams interface is not enjoyable
        while (is && n--)
            is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
        return is;
    };

    std::size_t n;
    if (skipline((data >> n), 2); n)
        while (data.good() && n--) {
            Customer c;
            if (skipline(data >> c.id >> c.SSN))
                customers.push_back(c);
        }

    if ((n != -1ull) || data.fail())
        throw std::invalid_argument("parse_customers");
    return customers;
}

See it Live On Compiler Explorer

Boost Spirit

Let's try with Boost Spirit:

auto marker  = no_case["number of customers"];
auto header  = *(char_ - eol); //one whole line
auto id      = uint_;
auto SSN     = uint_;
auto record  = rule<void, Customer> {} = id >> SSN;
auto grammar
    = seek[marker] >> eol
    >> omit[uint_ >> eol] // ignored
    >> omit[header] >> eol // ignored
    >> record % eol
    ;

Assuming any suitable Customer type:

#include <boost/fusion/adapted/std_pair.hpp>
using Customer = std::pair<int /*id*/, uintmax_t /*ssn*/>;

(You can use structs as well, the point is the "automatic" propagation of members in strongly typed fashion).

See it Live On Compiler Explorer printing

Parsed 25 customers: {(1, 1234), (2, 3456), (3, 3457), (4, 3458), (5, 3459), (6, 3460), (7, 3461), (8, 3462), (9, 3463), (10, 3464), (11, 3465), (12, 3466), (13, 3467), (14, 3468), (15, 3469), (16, 3470), (17, 3471), (18, 3472), (19, 3473), (20, 3474), (21, 3475), (22, 3476), (23, 3477), (24, 3478), (25, 123)}

Actually, you could simplify (in this specific case) to just

auto grammar = seek[(uint_ >> uint_) % eol];

because all the rest is supposedly ignored and wouldn't match.

sehe
  • 374,641
  • 47
  • 450
  • 633