2

If I have a CSV file, is it faster to keep the file as place text or to convert it to some other format? (for searching)

In terms of searching a CSV file, what is the fastest method of retrieving a particular row (by key)? Not referring to sorting the file sorry, what I mean was looking up a arbitrary key in the file.

Some updates:

  1. the file will be read-only
  2. the file can be read and kept in memory
loyalflow
  • 14,275
  • 27
  • 107
  • 168
  • @HunterMcMillen say 1 million rows, I edited the question, looking for lookup algorithms not sorting. – loyalflow Jun 04 '12 at 18:31
  • You can index the file by the key. Look at [**this answer**](https://stackoverflow.com/questions/50899880/how-can-i-split-csv-files-in-python/65189049#65189049) for clues. –  Dec 10 '20 at 13:22

4 Answers4

2

There are several things to consider for this:

  • What kind of data do you store? Does it actually make sense, to convert this to a binary format? Will binary format take up less space (the time it takes to read the file is dependent on size)?

  • Do you have multiple queries for the same file, while the system is running, or do you have to load the file each time someone does a query?

  • Do you need to efficiently transfer the file between different systems?

All these factors are very important for a decision. The common case is that you only need to load the file once and then do many queries. In that case it hardly matters what format you store the data in, because it will be stored in memory afterwards anyway. Spend more time thinking about good data structures to handle the queries.

Another common case is, that you cannot keep the main application running and hence you cannot keep the file in memory. In that case, get rid of the file and use a database. Any database you can use will most likely be faster than anything you could come up with. However it is not easy to transfer a database between system.

Most likely though, the file format will not be a real issue to consider. I've read quite a few very long CSV files and most often the time it took to read the file was negligible compared to what I needed to do with the data afterwards.

LiKao
  • 10,408
  • 6
  • 53
  • 91
  • +1 For all the considerations to do before deciding. Even if this looks more like a Programmers.SE answer. – Vitaly Olegovitch Jun 04 '12 at 21:07
  • @user1361315: If you can keep the file in memory, do not care at all about the format. Parse the file into a datastructure optimized for your type of queries and use that one. Analyzing the type queries you expect and implementing the data structures will be enough work, so spent your time optimizing there. Your questions makes it seem like you are planning to keep the file in memory as a stream of bytes as is (i.e. without parsing it). That would be a very stupid idea, since parsing will make your life easier here. After parsing the memory structure will be the same no matter the file format. – LiKao Jun 04 '12 at 22:30
  • True, but what if the file size is too large for memory? – loyalflow Jun 05 '12 at 14:13
  • In that case put it into a database. What system are you running on, which cannot handle 1Million rows? That should only take a few MB depending on the length of the rows. You do never need to hold the complete file in memory while reading it. Parse it into the data structure, and use that, the data structure will be much smaller. If the data even is to big to be kept in any kind of data structure, then you need something different alltogether, e.g. a file based data structure like a B-Tree or a database (basically a B-Tree provided with some additional fancy stuff). – LiKao Jun 06 '12 at 11:13
  • 1
    @user1361315: Also note that if the data is too large to fit into memory, then reading a CSV-File each time and searching it linear will be a very bad choice. So you better pick a different structure that allows more direct access. I am very confused though by the remark, that the file will not fit completely in the memory, because this is contradictory to the update: "the file can be read and kept in memory" which you added to the original question. Either it can fit into memory and then it can be kept there, or it is too big. It can't be both. – LiKao Jun 06 '12 at 11:18
1

If you have too much data and is very production level, then use Apache Lucene

If its small dataset or its about learning then read through Suffix tree and Tries

Suraj Chandran
  • 24,433
  • 12
  • 63
  • 94
1

"Convert" it (i.e. import it) into a database table (or preferably normalised tables) with indexes on the searchable columns and a primary key on the column that has the highest cardinality - no need to re-invent the wheel... you'll save yourself a lot of issues - transaction management, concurrency.... really - if it will be in production, the chance that you will want to keep it in csv format is slim to zero.

Nathan
  • 6,095
  • 2
  • 35
  • 61
0

If the file is too large to keep in memory, then just keep the keys in memory. Some number of rows can also be keep in memory, with least-recently-accessed rows paged out as additional rows are needed. Use fseeks (directed by keys) with the file to find the row in the file itself. Then load that row into memory in case other entries on that row might be needed.

Jiminion
  • 5,080
  • 1
  • 31
  • 54