2

There is a file about 44G. I want to add some content before the first line. I've tried to use sed. Like this

sed -i '1iSome content' /home/always/test.dat

However, it costs about 100min. Is there a faster way? Any way is ok. For example, Java, C, Linux tools...

Actual scene I want to import a csv to neo4j database.My steps is:

  1. Export a csv to hdfs from hive(I can't use the hive table header because it is not neo4j want.)
  2. Download the csv file to the local filesystem.
  3. Add a header to the csv file(neo4j-admin import needs it)
Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347
Always
  • 23
  • 6

3 Answers3

6

There is no way to insert an arbitrary amount of text at the start of a file that doesn't involve rewriting the entire file. This applies no matter what language or tool that you use.

You might get a speedup by using something other than sed to do this1, but the bottleneck is going to be disk / file system IO.

To get better performance:

  • treat the data as bytes and copy with a large buffer using read(2) and write(2) syscalls, or
  • use the sendFile(2) syscall so that the data doesn't need to be copied via a user space buffer, or
  • if the data being inserted is (or can be padded to) an exact multiple of the file system block size and the file system supports fallocate(2) then you can use this to insert the data without copying2.

C is probably the best language for coding this.

Alternatively, if you wanted to stick with existing command line utilities, using cat or dd with the appropriate flags would probably be faster than sed.


1 - sed will most likely be splitting the input into lines and then reassembling the lines in a user-space buffer. This is unnecessary.

2 - The padding might consist of additional whitespace or "comment" lines ... assuming that the application that is reading the file can deal with these. If it can, see https://stackoverflow.com/a/59571893/139985 for example code to get you started.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
3

Prepending to gigantic files is always going to be slow, no matter what language you do it in, because of how files are stored on the filesystem. However, there's one exception: if you want to insert a multiple of the block size, you can use fallocate to do it quickly, provided that the underlying filesystem supports it (such as ext4). For example, here's how you'd prepend 4096 x's to the_big_file:

#define _GNU_SOURCE

#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>

#ifndef FALLOC_FL_INSERT_RANGE
#define FALLOC_FL_INSERT_RANGE 0x20
#endif

int main(void) {
    int fd = open("the_big_file", O_WRONLY);
    if(fd == -1) {
        perror("open");
        return 1;
    }
    if(fallocate(fd, FALLOC_FL_INSERT_RANGE, 0, 4096)) {
        perror("fallocate");
        return 1;
    }
    char buf[4096];
    memset(buf, 'x', sizeof(buf));
    ssize_t written = 0, result;
    do {
        result = write(fd, buf + written, sizeof(buf) - written);
        written += result;
    } while(result > 0);
    if(result < 0) {
        perror("write");
        return 1;
    }
    if(close(fd)) {
        perror("close");
        return 1;
    }
    return 0;
}
  • Using `sendfile(2)` could be faster. – Stephen C Jan 03 '20 at 00:59
  • @StephenC Than `fallocate`, on a 44GB file? Not a chance. This code is basically instant how matter how big the file is. – Joseph Sible-Reinstate Monica Jan 03 '20 at 01:00
  • 1
    Oh ... ah. I see. – Stephen C Jan 03 '20 at 01:07
  • However, thank you very much.But I don't need to insert a multiple of the block size.The actual scene is that I want to import data to neo4j database.While it needs a header.But the csv file I have exported from doesn't include a header. – Always Jan 03 '20 at 01:08
  • @Always If you can't live with a multiple of the block size, then you're stuck with it being slow. You should consider redoing your program so that you can append instead of prepending (which is always fast), or accept some padding with the first line to make it fit a multiple of the block size. – Joseph Sible-Reinstate Monica Jan 03 '20 at 01:09
  • @Always In that case, this is [an XY problem](https://meta.stackexchange.com/q/66377/386992), and the right answer is that you should write the header first and then the data, rather than writing the data first and then trying to jam a header in front. – Joseph Sible-Reinstate Monica Jan 03 '20 at 01:16
  • Sorry for that is only one accept.The file is exported from hdfs, I can't do that.I will add the scene to the question later. – Always Jan 03 '20 at 01:28
1

Inserting data in front of files without rewriting the whole file (which is what is slow here) is generally not possible in most operating systems typically available on modern PC's due to the way file systems work.

As all you need is to have a header in front of the datafile you may be able to do this as part of the download step if it is a simple wget or similar. For Linux this would look like:

(echo “header line 1”; echo “header line 2”; wget .... -O -) > big.csv

Or perhaps even pipe directly into the target program.

You will need to handle error situations carefully.

Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347