2

I am trying to parse a text file the contains numeric data. I have a lot of lines that look like

129.3 72.7 121.6 173.6 203.3 120.7 40.5 79.2 94.0 123.2 165.8 178.8 135.5 78.5 66.2

but the length of the lines vary. Each line is also preceded by a few spaces. I would like to use regular expressions to parse the line and place each number into an array that I can then manipulate later.

Using

std::getline(is, line);

std::tr1::regex rx("[0-9-\.]+");
std::tr1::cmatch res;
std::tr1::regex_search(line.c_str(), res, rx);

only matches the first number. If instead I use line anchors such as

"^[0-9-\.]+$" 
"^[0-9-\.]+"

I get no matches and

"[0-9-\.]+$"

just matches the last number. So I am probably doing something wrong. Thanks for any help.

jetak
  • 23
  • 3
  • res is an array, i.e. res[1], res[2], res[3]... should have your matches. Have you checked that or are you just getting res? –  Feb 17 '12 at 23:08
  • regexp are really not the best solution here, just using operator>> into floats is much easier to use, and much better suited. – PlasmaHH Feb 17 '12 at 23:19
  • I agree with PlasmaHH, but who knows for what ever reason, someone wants to play with regex... –  Feb 17 '12 at 23:22
  • I checked the size of res and it only contains one element. I would use the operator>> but the number of elements per line changes, some lines have 15 others have less. – jetak Feb 17 '12 at 23:35
  • sure, but this is because your regex is incorrect. `>>` is certainly THE C++ way to do it, but regex gives additional flexibility. It is good to know. –  Feb 18 '12 at 18:56

4 Answers4

2

Um, pseudocode

 for str in strtok(input string)
     vector[index] = convert str to float

Here's an example using lots of stream magic: Split a string in C++?

Here's an example using a vector: Splitting a string by whitespace in c++

But plain old strtok is probably easiest: http://www.cplusplus.com/reference/clibrary/cstring/strtok/

in which case you'll get something like

Vector flts = // create it 
for(int ix=0, char * cp; cp = strtok(str," "); ix++){
    flts[ix] = atof(cp);
}

Now, that's very C like because I'm out of practice for C++, but the key point here is that by trying to use regex, you make it overcomplicated.

Community
  • 1
  • 1
Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
  • Indeed... when you want to have to choose between standards-compliant and thread-safety, strtok wins! – jkerian Feb 17 '12 at 23:01
  • thanks for the help. Ill give it a try. And yes regexp is a bit overkill for such a simple task but it was the first thing that came to mind. I'm a bit our of practice with c++. – jetak Feb 17 '12 at 23:37
  • Gee, @g24l, do you suppose that's why I said " that's very C like"? Btu if I were *really* doing it in C fashion I'd have used `sscanf`. – Charlie Martin Feb 18 '12 at 06:13
  • I've seen it, but this is the accepted answer in a question about C++ and it might be a bit misleading. –  Feb 24 '12 at 17:17
0

You need to include the space between the numbers in your match to match the whole line.

BTW, take a look at C++ tokenize a string using a regular expression to see a rather closely related answer.

You really shouldn't be using arrays here, use the standard containers for safety, convenience and sanity of anyone who has to look at this code later.

Community
  • 1
  • 1
jkerian
  • 16,497
  • 3
  • 46
  • 59
0

I looks like the regex has a small issue:

"[0-9-\.]+"

should be more like:

 "[0-9\.]"
macduff
  • 4,655
  • 18
  • 29
0

your regex might be incorrect, you should try:

[0-9\.]+

also keep in mind that std::tr1::cmatch returns an arrays of matches, i.e. res[2] contains 72.7

Using egrep you can experiment a bit:

egrep "[0-9-\.]+" /tmp/x
egrep: Invalid range end

but

egrep "^[0-9\.]+" /tmp/x

matches only

129.3 

and

egrep "[0-9\.]+" /tmp/x

matches all

129.3 72.7 121.6 173.6 203.3 120.7 40.5 79.2 94.0 123.2 165.8 178.8 135.5 78.5 66.2

you don't need ^ in front because it matches a null character at the start of the string, i.e. you gen only the first sequence of numbers.

you don't need $ because it matches only the null character at the end, thus you get only the last sequence of numbers

you need + since you want to get all the matching atoms of type [0-9\.].

Also you can get a short guide regex matching in any unix system by issueing

man -S 7 regex

p.s. /tmp/x is a file with the line that is provided in the question.