1

I am trying to selectively plot some data from a file with columns, which isn't .csv or .tsv A typical file is http://pastebin.com/pgSjezdh.

You can see that there is some info at first, which of course should be somehow skipped. Then there are some columns, from which I would like to plot one over the other. For instance the first column being x and the fourth being y.

The idea is to use CERN's root, which is using C/C++, so ideally this should be done in C/C++ so that root could handle it.

Apart from that, the main problem is to somehow get the desired data in a two-column format, without strings.

What is the most efficient way to do that?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Thanos
  • 594
  • 2
  • 7
  • 28
  • I would consider using flex instead. With flex you could easily write a script which splits the columns into 6 different files. Then you could read the 6 files with a C++ program. – HAL9000 Dec 05 '13 at 13:44
  • @HAL9000: Thank you very much for your comment! Flex? What exactly is that and how to use it? – Thanos Dec 05 '13 at 13:48
  • Flex is a scanner which combine the C language with the powerful of regular expressions. If you want to learn something new (if you think it could be useful for your career), here is the link to some examples: http://www.cs.princeton.edu/~appel/modern/c/software/flex/flex.html . If, instead, you need something right now, you could use fscanf to count 24 lines, then again with fscanf you would parse each value of each line in order to fill 6 float arrays. – HAL9000 Dec 05 '13 at 14:04
  • @HAL9000: Thank you very much for your help! Actualy I need this to done for several files. The way I imagined it, there was going to be a script, that would take as an input the file and it would give the desired file as an output. What is the best way to do it! – Thanos Dec 05 '13 at 14:42

2 Answers2

1

You can use this Flex code, named cern.l:

%x _DATA_
%x _END_

%option noyywrap
%{
  #define NUMBER_OF_COLUMNS 6
  #define FILENAME_LENGTH 64

  // variables
  float data;
  int dataNumber;
  FILE *files[NUMBER_OF_COLUMNS];
  char fileNames[NUMBER_OF_COLUMNS] [FILENAME_LENGTH];

  // functions
%} 
NUMBER [0-9]*(\.[0-9]+)?(E(\+|-)[0-9][0-9])?

%% 
"-----------  ---------- ---------- ----------  ----------  ----------" {
            fprintf(stderr,"\nBEGIN DATA");
            BEGIN(_DATA_);
        }
<_DATA_>"-----" {
                    BEGIN(_END_);
         }
<_DATA_>{NUMBER} {
         data = atof(yytext);
         fprintf(files[dataNumber],"%f\n",data);
         dataNumber = (dataNumber+1)%NUMBER_OF_COLUMNS;
        }
<_DATA_>"----------" { 
      BEGIN(_END_);
      }
%%
int main(int argc, char* argv[])
{
  int i = 0;
  dataNumber = 0;
  for(i = 0; i < NUMBER_OF_COLUMNS; i++)
  {
    sprintf(fileNames[i],"/home/user/column%d.txt",i);
    files[i] = fopen(fileNames[i],"w");
  }
  yyin=fopen("/home/user/example.dat","r");

  yylex();
  for(i = 0; i < NUMBER_OF_COLUMNS; i++)
  {
    fclose(files[i]);
  }
  return 0;
}

You can compile it using make cern.l. Under Linux it will generate a binary executable cern. Calling ./cern it will generate 6 files (named column0.txt column1.txt etc) containing your data.

If you prefer to integrate this code as a function in your C++ program, you can simply compile it using flex cern.l. It will generate a file named lex.yy.c which contains the C code. Then, you can rename main into lex.yy.c with a function name, for example int parse(float *column0, float *column1, float *column2, float *column3, float *column4, float *column5) and call it from your program. Obviously in this case you should previously modify the original flex code in order to fill 6 float arrays with data.

Why using Flex? Because it generate an optimized scanner. If you need to parse large amount of data in a very small time it can be very useful. I also find it simpler than manually create a FSM using C language.

HAL9000
  • 3,562
  • 3
  • 25
  • 47
  • Thank you very much for your answer! In your code I see that you define the number of columns(which is going to be probably a constant) and the `FILENAME_LENGTH`... Why is it 64? In addition, can this be run on windows? I hate to say that, but there are some utilities i use on windows and it's rather annoyin to boot on windows, then reboot on linux and then again on windows... – Thanos Dec 05 '13 at 21:38
  • I used the Windows version of flex and works great, so don't worry, the usage is the same. Take a look here http://gnuwin32.sourceforge.net/packages/flex.htm – HAL9000 Dec 05 '13 at 21:46
  • 64 is a reasonable length for a file name, just this. I used a constant for the number of columns but you can use an integer variable and allocate the multidimensional array using malloc. – HAL9000 Dec 05 '13 at 21:47
  • Thank you very much for your help! I downloaded and installed `Flex`. On a cmd, I move to Desktop(where the `cern.l` is saved) and whichever `make cern.l` or `flex cern.l` I type, I get the message *`make`(or `flex`) cannot be recognised* – Thanos Dec 06 '13 at 08:43
  • Probably you can use flex only inside folder where you saved flex.exe. Try moving cern.l into that folder and with cmd work in that directory. – HAL9000 Dec 06 '13 at 08:50
  • I run `flex cern.l` and although I got no error message, there wasn't any `lex.yy.c` file generated. Could I have done something wrong? – Thanos Dec 06 '13 at 08:59
  • Try the command: make cern (without .l extension). Unfortunately I'm not using Windows and I can't try myself. – HAL9000 Dec 06 '13 at 09:02
  • Try this: http://stackoverflow.com/questions/7641675/cannot-install-flex-lexical-analyser-on-windows-unable-to-find-comprehensive – HAL9000 Dec 06 '13 at 09:12
  • The paths are set correctly however I can only run it if I manualy go to the path where it's installed... I tried sth diefferent. I moved to the path, and doubled click on flex.exe A command prompt showed up. I typed `make cern.l` and I got an error `"", line 5: name defined twice`... – Thanos Dec 06 '13 at 09:51
  • I decided to test it on `ubuntu 12.04`. I used `sudo apt-get install flex` to install. Then, I used `make cern.l` but I got the response that `make: Nothing to be done for 'cern.l'`. In the end I used `flex cern.l` and indeed a `lex.yy.c` file came up. I then replaced `int parse(int argc, char* argv[])` with `int parse(float *column0, float *column1, float *column2, float *column3, float *column4, float *column5)` in the `.c` file and tried to compile it using `gcc -Wall -g -o cern lex.yy.c` but I get errors – Thanos Dec 08 '13 at 09:57
  • `lex.yy.c:1166:17: warning: ‘yyunput’ defined but not used [-Wunused-function] lex.yy.c:1207:16: warning: ‘input’ defined but not used [-Wunused-function] /usr/lib/gcc/i686-linux-gnu/4.6/../../../i386-linux-gnu/crt1.o: In function `_start': (.text+0x18): undefined reference to `main' collect2: ld returned 1 exit status` – Thanos Dec 08 '13 at 10:01
  • to use make you don't have to use .l extension. Try the command: make cern About errors on using lex.yy.c: you need to have main somewhere. If you want to use the default main provided by flex, you don't have to rename main. You need to rename main (or clone it) only in the case that you want to use lex.yy.c as a library linked to your .c program (which must provide main). – HAL9000 Dec 08 '13 at 10:04
  • Now I see... I used `make cern` and executable really showed up, but when I type `./cern` it doesn't do anything at all.. I also tried `./cern data.txt` where data.txt is the file to be processed but again nothing happens... – Thanos Dec 08 '13 at 10:26
  • did you change "/home/user/example.dat" with your actual file location? Output files will be put in "/home/user/column%d.txt", change it to your home location. Also add some printf to main and try some debug. – HAL9000 Dec 08 '13 at 10:31
  • Yes, actually I have. I replaced `/home/user/column%d.txt` with `~/Desktop/column%d.txt` and `/home/user/example.dat` with `~/Desktop/data.txt` in the `cern.l` file and I get nothing at all. It seems that `./cern` keeps running forever... – Thanos Dec 08 '13 at 10:39
  • As I know from theory is not possible to have an endless loop with a Flex scanner, while all my for cycles are safe. You should add some printf to debug. Anyway, I tested it on my machine and works correctly. – HAL9000 Dec 08 '13 at 11:05
  • Thank you so much for your help and time! Where to add the printf? And how to debug it? As you understand I am not familiar with these stuff... – Thanos Dec 08 '13 at 21:10
  • Before data = atof(yytext); you can add a printf. That line is called everytime a number is found. Put a printf before yylex(); to know if the scanner is going to start. Put another printf before BEGIN(_END_); to know when the scanner finish. Put as many printf as you can and try to understand what happens. Here you can find a tutorial for beginners http://alumni.cs.ucr.edu/~lgao/teaching/flex.html – HAL9000 Dec 08 '13 at 21:15
  • Thank you for the link;it's quite useful!!! What should `printf`, print? Just add `printf();`? – Thanos Dec 08 '13 at 21:19
  • Then, maybe you should start from here http://en.wikipedia.org/wiki/Hello_world_program#History – HAL9000 Dec 08 '13 at 21:31
  • You really mean to just print Hello World, in every step of the code? – Thanos Dec 08 '13 at 21:42
  • 1
    Print anything useful. Print "here", print some variable. It's called printf debugging http://stackoverflow.com/questions/189562/what-is-the-proper-name-for-doing-debugging-by-adding-print-statements – HAL9000 Dec 08 '13 at 22:02
1

It's not quite clear from your description and the example input, but I assume that the data columns are in lines 25 to 51. It looks as if the row of hyphens just above that indicates the columns. The first column starts with the first character of the line. (There is a similar table above that with four columns and one row, which is indented, which should probably be skipped.) The data rows are terminated by another row of hyphens.

So the basic algorithm is: Read everything up to the row of hyphens, store column widths and starting points for each field, then read the subsequent data, cut out the columns you want using the information from the hyphens and stop reading when you encounter the next line.

That's probably something a script can do easily for you. The standalone C program below does that, too. You can call it from the command line like that:

./colcut data.txt 3 5 1

to print out columns 3, 5 and 1 (natural count, not zero-based) of the file "data.txt". Error handling is probably lacking - it doesn't check whether the columns are long enough, for example - but it looks serviceable:

#include <stdlib.h>
#include <stdio.h>



#define die(x) do {                             \
        fprintf(stderr, "Fatal: %s\n", x);      \
        exit(1);                                \
    } while (0)

#define MAX 10
#define MAXLEN 500

typedef struct {            /* Text slice into a char buffer */
    const char *str;        /* start pointer */
    int len;                /* slice length */
} Slice;

int main(int argc, char *argv[])
{
    FILE *f;
    int index[MAX];         /* column index */
    int nindex = 0;         /* number of columns to write */
    Slice cols[MAX];        /* column substrings */
    int context = 0;        /* Are we scaning columns? */
    int i;

    if (argc < 3) die("Usage: col file columns ...");

    for (i = 2; i < argc; i++) {
        int n = atoi(argv[i]);

        if (n < 1 || n > MAX) die("Illegal index");
        index[nindex++] = n - 1;
    }

    f = fopen(argv[1], "r");
    if (f == NULL) die("Could not open file.");

    for (;;) {
        char line[MAXLEN];

        if (fgets(line, MAXLEN, f) == NULL) break;

        if (context) {
            if (line[0] == '-') break;
            for (i = 0; i < nindex; i++) {
                int j = index[i];

                printf("    %.*s", cols[j].len, cols[j].str);
            }
            putchar(10);
        }

        if (line[0] == '-') {
            const char *p = line;
            int n = 0;

            while (*p == '-' || *p == ' ') {
                cols[n].str = p;
                while (*p == '-') p++;
                cols[n].len = p- cols[n].str;
                while (*p == ' ') p++;
                if (++n == MAX) break;
            }

            for (i = 0; i < nindex; i++) {
                if (index[i] >= n) die("Columns index out of range");
            }
            context = 1;
        }
    } 
    fclose(f);   

    return 0;
}
M Oehm
  • 28,726
  • 3
  • 31
  • 42
  • Thank you very much for your answer! I tried to compile it(I am using `CodeBlocks` and `MinGW` on Vista) I get an error with status 1. – Thanos Dec 05 '13 at 21:30
  • Well, this is a stand-alone console application that has basically one routine, `main` and depends on some std library routines. Your second `main` must come frome somewhere else. Try compiling without CodeBlocks, which probably has its own `main` or `WinMain`. I've compiled and tested it with `gcc -Wall -g -o colcut colcut.c` – M Oehm Dec 06 '13 at 07:31
  • I am running windows vista. I have downloaded `MinGW` but when I type `gcc -Wall -g -o colcut colcut.c` on a command prompt, I get that gcc cannot be recognised. I also tried `mingw -Wall -g -o colcut colcut.c` but again I get the same... – Thanos Dec 06 '13 at 08:35
  • Well, I can't compile the code for you. You asked for a solution in C or C++, so I thought that you already knew how to compile. – M Oehm Dec 06 '13 at 10:53
  • I compiled it(there was a mess with the paths and I have to run it inside the intallation directory) and indeed a `colcut.exe` showed up! The thing is that when I double click on it, nothing happens... I also used `colcut.exe data.txt 3 5 1` but there is an error `Fatal:Illegal index` – Thanos Dec 06 '13 at 11:14
  • Okay, `colcur.exe` is a console application: You have to use it from the DOS shell od MinGW shell or , if you want to double-click, write a batch file with the command. The illegal index error comes from the exe and tells you that you want to print column number 5 of a document with only four columns, for example. Have you looked at the code? The error message is there. – M Oehm Dec 06 '13 at 12:04
  • That is exactly the problem! The file that I used (http://pastebin.com/pgSjezdh) has 6 columns, so I don't see why the `if` statement is complaining! – Thanos Dec 06 '13 at 12:42
  • An idea would be that above the data, there is also a hyphen, and the first column is string("Si" for the linked file). So I used `colcut.exe data.txt 2` just to test it but I get the same error... – Thanos Dec 06 '13 at 12:51
  • So you are running tha program on the example data you linked? That's what I used for testing and it worked. The indented row of hyphens (Atom name/Si) is ignored, because there must be a hyphen at the beginning of the line. But the program doesn't gat that faar: The "Illegal index" error occurs earlier, when the indices are sanity-checked. You could try to print `n` after the call to `atoi` or run the program in a debugger to find the error. – M Oehm Dec 06 '13 at 13:39
  • I followed the steps in your answer and indeed we are talking about the same file! `You could try to print n after the call to atoi or run the program in a debugger to find the error.` I can't say I quite understood... What `n` and what is `atoi`? Do you know a good debugger? – Thanos Dec 06 '13 at 13:57
  • I used `colcut.exe data.txt 2` and in the cmd I got the data values. So it's working, but something is wrong with the output file... Also if I use column 4 `colcut.exe data.txt 2 3 4` the `um` is present on the output. How to get rid of that? Thank you soooo much for your help!!! – Thanos Dec 06 '13 at 14:07
  • The error "Illegal index" occurs near the beginning of `main` inside a `for` loop when the column indices are read from the command file. There is a call `n = atoi(argv[i])`, print `n` after the call to check whether the numbers on the command line got converted properly. The error occurs if any of the arguments after the file are not numbers. (I told you that error checking is propably lacking but I was under the impression that you looked for a solution to access the columns, not for a ready-made solution.) – M Oehm Dec 06 '13 at 14:11
  • Ok I understand about the errors! There is no problem! You helped me quite enough!!! Is there a way to print the columns in a seperate file and get rid of the units in data(columns 1, 4, 5, 6)? – Thanos Dec 06 '13 at 14:42
  • There are many ways, most of them don't involve C. For example if you have awk (which minGW might have; cygwin does), you can apply some yery simple heuristics for the data at hand and do something like this: `awk '$6=="um" {x=1; if ($2=="keV") x=0.001; printf "%12s%12s%12s%12s\n", x*$1, $5, $7, $9}' data.txt` That's not straightforward, because awk is a language of its own, but it extracts columns and even converts keV to MeV. If you often have to work on raw data, learning a text processing language like awk or perl can pay dividends. – M Oehm Dec 06 '13 at 15:07
  • That really does the job, although I cannot understand fully the syntax... which means I have study to do!!! Is there a way to print the output in a new `.txt` file? – Thanos Dec 08 '13 at 09:05
  • If you want a solution just for your case, use the code and don't think about it. If you want to change the code, look into awk. Might also want to look into how the shell works: You can reditrect the outpuit as usual with `> result.txt`. – M Oehm Dec 08 '13 at 16:34
  • Thank you very much!!! `awk` looks like a nice tool, so I believe I should take a look at it. – Thanos Dec 08 '13 at 21:21