1

I'm trying to read a bunch of data from mmap and this functions as I expect it to.

size_t filesize = getFilesize(argv[1]);
int fd = open(argv[1], O_RDONLY, 0);
assert(fd != -1);
char* mmappedData = static_cast<char*>(mmap(NULL, filesize, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0));
assert(mmappedData != MAP_FAILED);

char *strmmap = strdup(mmappedData);
char *strData = strtok(strmmap, "#");
strData = strtok(strData, ";");
string rid;
unsigned long int timestamp;
string tid;
string TIMESTAMP;

Here is some of the test data from mmap

1497648366867,{75: 5, 76: 2, 77: 4, 78: 1, 79: 0, 80: 3}
79;ns]D;1497648366929
77;_1[A;1497648366940
78;~E=);1497648366940
78;~E=);1497648366943
77;_1[A;1497648366947
80;QXD=;1497648366991
78;E}Hy;1497648366991
78;E}Hy;1497648366997
80;QXD=;1497648367004

I'm struggling with the string manipulation

    do{
        timestamp = strtoul(strData, NULL, 0);
        strData = strtok(NULL, ";");
        tid = strData;
        cout <<"tid:"<<tid << '\t' << "timestamp:"<<timestamp <<endl;
    } while((strData = strtok(NULL, ";")) != NULL);
    cout <<"================================="<<endl;

the timestamp variable returns the timestamp perfectly for the first line of the test_data, but fails for the rest of it.

What I'm trying to achieve is something like this, assuming lines is an array of the data split with semicolon if there exist three data points, or split with comma if there are two data points. Here is a prototype in python.

for l in range(len(lines)):
 if len(line[l]) == 2:
   timestamp = line[l][0]
   tids = line[l][1]
 else:
   tid = line[l][0]
   rid = line[l][1]
   timestamp = line[l][3]
tandem
  • 2,040
  • 4
  • 25
  • 52
  • which line is C++? – Andriy Tylychko Jun 16 '17 at 21:41
  • Any C++ code that calls functions like strdup and strtok is wrong. Arguably, and C code that calls them is wrong too. –  Jun 16 '17 at 21:42
  • 1
    @Gruffalo The fourth one. –  Jun 16 '17 at 21:43
  • @NeilButterworth: what would you use instead of strdup and strtok? – tandem Jun 16 '17 at 21:45
  • @NeilButterworth: ah, missed it :) and what I tried to clumsy say was - use C++, `std::string` in this particular case – Andriy Tylychko Jun 16 '17 at 21:46
  • @tandem In C++, I'd use std::string. In C, I'd write my own parser as the C standard library doesn't really have anything useable in this area. –  Jun 16 '17 at 21:46
  • @NeilButterworth, a good starting point for which member function within `string` would be helpful – tandem Jun 16 '17 at 21:58
  • @tandem Your basic approach is wrong - you don't want to modify the string (which is what strtok does, and why it is toxic), you want to extract substrings into a vector or similar container. –  Jun 16 '17 at 22:00
  • @NeilButterworth, but wouldn't that imply converting the char* to a string to make it a substr? – tandem Jun 16 '17 at 22:10
  • @tandem Yes, maybe, in the same way that you are converting it to a malloc'd C string. Or you could just create a vector of strings from the original C string, but _not_ using strtok and strdup. –  Jun 16 '17 at 22:12
  • @NeilButterworth, got that. so would you also use substr or getline to split strings? (p.s: I'm not looking for a hard coded solution, i'm only looking for help on the best choice of actions) – tandem Jun 16 '17 at 22:20
  • I don't know enough about how your memory mapped string is used to answer this, but I would probably write my own simple parser for the MM string that produced std::strings. Apart from my issues with strtok modifying the thing it processes, it also has no error detection abilities, which you really want when parsing data. –  Jun 16 '17 at 22:26
  • @NeilButterworth: Let me give that a shot for a couple more hours and see how far I get. – tandem Jun 16 '17 at 22:31
  • @NeilButterworth: https://stackoverflow.com/a/236803/1059860 this seems to help a lot as a starting point – tandem Jun 16 '17 at 22:50
  • I got the things working. I could show the example here, maybe you can help me improve it? – tandem Jun 17 '17 at 00:56

0 Answers0