0

I have many files containing :

  • data: numbers that I have to use/manipulate, formatted in a specific way, specified in the following,

  • rows that I need just as they are (configurations of the software use these files).

The files most of time are huge, many millions of rows, and can't be handled fast enough with bash. I have made a script that checks each line to see if it's data, writing them to another file (without calculations), but it's very slow (many thousand rows per second).

The data is formatted in a way like this:

text 
text
(
($data $data $data)
($data $data $data)
($data $data $data)
)
text
text
(
($data $data $data)
($data $data $data)
)
text
( text )
( text )
(text text)

I have to make another file, using $data, that should be the results of some operation with it.

The portions of file that contains numbers can be distinguished by the presence of this occurrence:

(
(

and the same:

)
)

at the end.

I've made before a C++ program that makes the operation I want, but for files containing columns of numbers only. I don't know how to ignore the text that I don't have to modify and handle the way the data is formatted.

Where do I have to look to solve my problem smartly?

Which should be the best way to handle data files, formatted in different ways, and make math with them? Maybe Python?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
teo2070
  • 23
  • 4
  • Now that I re-read your question, you said you already have a script to separate out the data. :) Do you want a better filter script, or do you want something to do math on the data after filtering? As-is, neither of the answers @MvG and I have provided seem to quite answer what you're asking. – dannysauer May 02 '14 at 19:58

4 Answers4

1

Are you sure that the shell isn't fast enough? Maybe your bash just needs improved. :)

It appears that you want to print every line after a line with just a ( until you get to a closing ). So...

#!/usr/bin/ksh
print=0
while read
do
    if [[ "$REPLY" == ')' ]]
    then
        print=0
    elif [[ "$print" == 1 ]]
    then
        echo "${REPLY//[()]/}"
    elif [[ "$REPLY" == '(' ]]
    then
        print=1
    fi
done
exit 0

And, with your provided test data:

danny@machine:~$ ./test.sh < file
$data $data $data
$data $data $data
$data $data $data
$data $data $data
$data $data $data

I'll bet you'll find that to be roughly as fast as anything else you would write. If I was going to be using this often, I'd be inclined to add several more error checks - but if your data is well-formed, this will work fine.

Alternatively, you could just use sed.

danny@machine:~$ sed -n '/^($/,/^)$/{/^[()]$/d;s/[()]//gp}' file
$data $data $data
$data $data $data
$data $data $data
$data $data $data
$data $data $data

performance note edit:

I was comparing python implementations below, so I thought I'd test these as well. The sed solution runs about identically to the fastest python implementation on the same data - less than one second (0.9 seconds) to filter ~80K lines. The bash version takes 42.5 seconds to do it. However, just replacing #!/bin/bash with #!/usr/bin/ksh above (which is ksh93, on Ubuntu 13.10) and making no other changes to the script reduces runtime down to 10.5 seconds. Still slower than python or sed, but that's part of why I hate scripting in bash.

I also updated both solutions to remove the opening and closing parens, to be more consistent with the other answers.

dannysauer
  • 3,793
  • 1
  • 23
  • 30
  • If either of those are unclear (especially the sed "line noise"), leave a comment and I'll explain them better. ;) – dannysauer May 02 '14 at 19:55
  • Can't you remove the extra check for `b`(and `b` altogether) if you use `if`...`elif` statements? – kirbyfan64sos May 02 '14 at 21:17
  • @kirbyfan64sos: the one check for `$b` is to ensure that the '(' is alone on a line. It's done at the end so that it doesn't set `print=1` before the line that prints, ensuring that lone-paren line will be skipped. I read $a and $b in order to take advantage of $IFS pruning any whitespace after what should be the first character. If it was done with one variable, there'd have to be a regex applied to catch any trailing whitespace after the `(`. – dannysauer May 02 '14 at 21:45
  • Switching to `if/elif/elif` knocks about a second off of my 80K line test in ksh and knocks about 5 seconds off of the bash execution. Good call; updating the answer. :) However, moving to just $a makes no significant difference (though, moving to $REPLY rather than named parameters is another second faster) – dannysauer May 02 '14 at 22:19
0

Here is something which should perform well on huge data, and it's using Python 3:

#!/usr/bin/python3

import mmap

fi = open('so23434490in.txt', 'rb')
m = mmap.mmap(fi.fileno(), 0, access=mmap.ACCESS_READ)
fo = open('so23434490out.txt', 'wb')
p2 = 0
while True:
    p1 = m.find(b'(\n(', p2)
    if p1 == -1:
        break
    p2 = m.find(b')\n)', p1)
    if p2 == -1:
        break # unmatched opening sequence!
    data = m[p1+3:p2]
    data = data.replace(b'(',b'').replace(b')',b'')

    # Now decide: either do some computation on that data in Python
    for line in data.split(b'\n'):
        cols = list(map(float, data.split(b' ')))
        # perform some operation on cols

    # Or simply write out the data to use it as input for your C++ code
    fo.write(data)
    fo.write(b'\n')
fo.close()
m.close()
fi.close()

This uses mmap to map the file into memory. Then you can access it easily without having to worry about reading it in. It also is very efficient, since it can avoid unneccessary copying (from the page cache to the application heap).

MvG
  • 57,380
  • 22
  • 148
  • 276
  • If you're just making one sequential pass over the file and the file isn't cached, I wouldn't think that mmap() would make a significant difference. Every bit from the file is still read in order from the disk. You mainly benefit from mmap when you're doing random access or otherwise seeking a lot - or if several programs with shared memory access the same file. In this situation, I'd expect regular ol' buffered I/O to be better. – dannysauer May 02 '14 at 19:53
  • @dannysauer: If the file is not in cache, it has to be read from disk, and all other costs will likely be negligible compared to that. If caches are hot, though, e.g. because the file just got written, or grepped, or whatever, then the benefit kicks in. I might be wrong, but I seem to recall that at least some system libs (glibc on Linux in particular) will implement other I/O on top of mmap, so you'd strip away one layer which at least can't be worse than buffered I/O. And I still believe that for things to end up in python arrays, you'd need one more memcopy which can be avoided. – MvG May 02 '14 at 19:59
  • I honestly don't know for sure how python's memory management works under the covers; I'm predominantly a perl guy, and so much happens behind perl's scenes that it wouldn't make a difference there (in this exact situation). Python is almost certainly quicker than perl here, but running an strace on both python variants would probably be interesting. There goes my afternoon. :) – dannysauer May 02 '14 at 20:13
  • @dannysauer: strace [says](https://gist.github.com/gagern/ce9132033eac9f0056c9) that there is no mmap syscall involved if you bulk read that file in Python. – MvG May 02 '14 at 20:21
  • @dannysauer: Seems I was wrong: glibc only uses mmap if you ask for it (by including `m` in the mode string). [This post](http://stackoverflow.com/a/6383253/1468366) has a nice discussion of the relative merits. – MvG May 02 '14 at 21:28
  • 1
    As it turns out, the mmap approach is, at least, better than looping over the file with a `for line in file` loop. I duplicated the sample data above 50K times in a file, resulting in an 8MB/850K line text file. line-by-line is slower, and has 2K read syscalls (3812 total syscalls) v/s 78 read() calls in the mmap'd version (1824 total). Using readlines is even slower, probably due to reading the whole file into an array first. :) Details on [pastebin](http://pastebin.com/qMMe80Kc) – dannysauer May 02 '14 at 21:34
0

Using C provides much better speed than bash/ksh or C++(or Python, even though saying that stings). I created a text file containing 18 million lines containing the example text duplicated 1 million times. On my laptop, this C program works with the file in 1 second, while the Python version takes 5 seconds, and running the bash version under ksh(because it's faster than bash) with the edits mentioned in that answer's comments takes 1 minute 20 seconds(a.k.a 80 seconds). Note that this C program doesn't check for errors at all except for the non-existent file. Here it is:

#include <string.h>
#include <stdio.h>

#define BUFSZ 1024
// I highly doubt there are lines longer than 1024 characters

int main()
{
    int is_area=0;
    char line[BUFSZ];
    FILE* f;
    if ((f = fopen("out.txt", "r")) != NULL)
    {
        while (fgets(line, BUFSZ, f))
        {
            if (line[0] == ')') is_area=0;
            else if (is_area) fputs(line, stdout); // NO NEWLINE!
            else if (strcmp(line, "(\n") == 0) is_area=1;
        }
    }
    else
    {
        fprintf(stderr, "THE SKY IS FALLING!!!\n");
        return 1;
    }
    return 0;
}

If the fact it's completely unsafe freaks you out, here's a C++ version, which took 2 seconds:

#include <iostream>
#include <fstream>
#include <string>

using namespace std;
// ^ FYI, the above is a bad idea, but I'm trying to preserve clarity

int main()
{
    ifstream in("out.txt");
    string line;
    bool is_area(false);
    while (getline(in, line))
    {
        if (line[0] == ')') is_area = false;
        else if (is_area) cout << line << '\n';
        else if(line == "(") is_area = true;
    }
    return 0;
}

EDIT: As MvG pointed out in the comments, I wasn't benching the Python version fairly. It doesn't take 24 seconds as I originally stated, but 5 instead.

kirbyfan64sos
  • 10,377
  • 6
  • 54
  • 75
  • I guess you compared to the Python code from my answer, but what version of it? To be fair, you'd have to use [the one](http://stackoverflow.com/revisions/23435083/1) where I don't postprocess the data, in particular don't convert strings to floats. With that, I see 5s for Python vs. 1s for your C. Still quite a difference, but not as much as those 24s you mention. – MvG May 02 '14 at 21:52
  • @MvG: Sorry! I didn't realize that. It's surprising how much of a difference replacing text in a string can make!(I had removed the conversion to floats). – kirbyfan64sos May 02 '14 at 21:59
  • Switch bash to ksh93; I added a note to my answer after testing and finding that bash sucks even more than I thought. :D – dannysauer May 02 '14 at 22:01
  • @dannysauer: Still takes longer over a minute on the 18 million line file. – kirbyfan64sos May 02 '14 at 22:16
  • Did you include the time it takes for the typical mid-level sysadmin who'd ask this kind of question to brush the dust off of a C programming book and compile the program, vs writing the shell version? ;) – dannysauer May 02 '14 at 22:29
  • @dannysauer: Well, the OP seems to know how to compile a program, since he made a C++ that (partly) worked. Although that is pretty true... :) – kirbyfan64sos May 02 '14 at 22:34
  • I guess that, just to defend ksh's honor some, Here's the [fastest ksh version](http://pastebin.com/jTSM5kSt) I could come up with. It's reasonably fast on my machine, but still ~10x slower than a "real" language. :) – dannysauer May 02 '14 at 23:43
0

I guess we need a perl solution, too.

#!/usr/bin/perl

my $p=0;
while(<STDIN>){
 if( /^\)\s*$/ ){
   $p = 0;
   }
 elsif( $p ){
   s/[()]//g;
   print;
   }
 elsif( /^\(\s*$/ ){
   $p = 1;
   }
}

On my system, this runs slightly slower than the fastest python implementation from above (while also doing the parenthesis removal), and about the same as

sed -n '/^($/,/^)$/{/^[()]$/d;s/[()]//gp}'
dannysauer
  • 3,793
  • 1
  • 23
  • 30