0

I want to to randomly sample a data frame without reading the entire csv in pandas. Is this possible?

There's an argument nrows but i think it gets the first n rows and it's not actually random.

I don't want to use .sample() because that means I have to read the entire csv first.

My code

sample_size = 10
df = pd.read_csv(input_data, nrows=sample_size)
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Eisen
  • 1,697
  • 9
  • 27
  • Why not read the entire CSV? Is it too big? – wjandrea Jul 03 '23 at 17:43
  • 1
    Its large so i dont want to take time to read the entire thing and just sample instead – Eisen Jul 03 '23 at 17:48
  • This could be helpful: [How can I partially read a huge CSV file? \[pandas\]](/q/29334463/4518341) – wjandrea Jul 03 '23 at 17:52
  • 1
    This thread describes a similar problem and also shows different solutions using the skiprows argument in pandas: https://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame . – Timo Jul 03 '23 at 17:55
  • I believe both of the previous suggested solutions will in reality still read the entire file. In both cases, the entire file is read but only selected rows are stored in memory. While I think these are both approaches that will reduce memory usage, neither address the OP's question related to not reading entire file – itprorh66 Jul 03 '23 at 18:34
  • @itprorh66 Rows can span multiple lines due to quoting, so reading the entire file is unavoidable -- at least based on the info provided, but maybe OP's CSV doesn't use quoting. – wjandrea Jul 03 '23 at 19:02
  • @wjandrea whlle what you say about CSV files is true, this fact is not evident from the OP's question. However, regardless of whether the file spans multiple lines, neither of the two prior suggestions result in anything but reading the entire file. Which was the entire point of my comment. – itprorh66 Jul 04 '23 at 13:39

0 Answers0