0

Given the following csv file :

01;blue;brown;black
02;glass;rock;paper
03;pigeon;squirel;shark

My goal is to replace the (unique) line containing '02' in the 1st posisition.

I wrote this piece of code:

with open("csv", 'r+', newline='', encoding='utf-8') as csvfile, open('csvout', 'w', newline='', encoding='utf-8') as out:
    reader = csv.reader(csvfile, delimiter=';')
    writer = csv.writer(out, delimiter=';')
    for row in reader:
        if row[0] != '02':
            writer.writerow(row)
        else:
            writer.writerow(['02', 'A', 'B', 'C'])

But re-writing the whole CSV in an other doesn't seem to be the most efficient way to proceed, especially for large files:

  1. Once the match is found, we continue to read till the end.
  2. We have to re-write every line one by one.
  3. Writing a second file isn't very practical nor is storage efficient.

I wrote a second piece of code who seems to answer to these two problems :

with open("csv", 'r+', newline='', encoding='utf-8') as csvfile:
    content = csvfile.readlines()
    for index, row in enumerate(content):
        row = row.split(';')
        if row[2] == 'rock':
            tochange = index
            break
    content.pop(tochange)
    content.insert(tochange, '02;A;B;C\n')
    content = "".join(content)
    csvfile.seek(0)
    csvfile.truncate(0)     # Erase content
    csvfile.write(content)

Do you agree that the second solution is more efficient ? Do you have any improvement, or better way to proceed ?

EDIT : The number of character in the line can vary.

EDIT 2 : I'm apparently obliged to read and rewrite everything, if I don't want to use padding. A possible solution would be a database-like solution, I will consider it for the future.

If I had to choose between those 2 solutions, which one would be the best performance-wise ?

SecT0uch
  • 71
  • 7
  • Is the CSV file sorted by that first column? If so, you can [binary-search and overwrite](https://stackoverflow.com/a/54884704/1639625) that specific line in O(logn), but you have to overwrite it with the same number of characters. – tobias_k Feb 28 '19 at 12:17
  • BTW, in your second code you are still reading and writing the entire file. – tobias_k Feb 28 '19 at 12:20
  • My CSV is actually sorted by the 4th column, starting with 00000001 from the second line (First line for column desc). The number of char can vary. – SecT0uch Feb 28 '19 at 13:24
  • Well, you could still do a linear search for that line, backtrack to the start of the line with `seek` and then overwrite just this one line, padding it with e.g. space to get to the old length, unless the old line is too short, then you have to overwrite all the following lines in the file, too (until you find another line that has been padded before and you can remove that padding). – tobias_k Feb 28 '19 at 13:29
  • That would indeed, answer the problematic, but I would like to avoid the use padding. – SecT0uch Feb 28 '19 at 13:43
  • About your follow-up question: A proper data base might indeed be better. Certainly faster for retrieving/editing/updating/deleting individual entries. – tobias_k Feb 28 '19 at 14:30

1 Answers1

1

As the caracter in the line may vary, I either have to read/write the whole file or; as @tobias_k said, use seek() to come back to the begining of the line and:

  • If the line is shorter, write just the line and pad with spaces;
  • If same length, write just the line;
  • If it's longer re-write that line and the following.

I want to avoid using padding so I used time.perf_counter() to measure exec time of both codes, and the second solution appears to be (almost 2*) faster (CSV of 10 000 lines, match at the 6 000th).

One alternative would be to migrate to a relational database.

SecT0uch
  • 71
  • 7