2

I am trying to run Python Twarc hydrate on a very large file of 2,339,076 records but it keeps freezing. I have tried the script on a smaller data set and it works fine. My question is, does Twarc have a maximum number of rows it can process? If so what is it? Do I need to separate my data in to smaller subsections?

I have tried the terminal command:

twarc2 hydrate 2020-03-22_clean-dataset_csv.csv > hydrated.jsonl

I have tried it on a smaller file and it works fine

I have tried searching to find whether the is a limit to the number of rows Twarc can process but I can't find an answer.

frogger
  • 31
  • 4
  • Maybe it's consuming all available memory? CSV is line based so can be easily processed in batches, but JSON is not – Mike Szyndel Apr 05 '23 at 14:30
  • OK yes makes sense, is there any way I an fix it do you think? Or should I split my data in to smaller files? – frogger Apr 05 '23 at 14:32
  • check your RAM usage, if all ram is full and you must use swap, it can become really slow – Caridorc Apr 05 '23 at 15:03

1 Answers1

0

You can use the built-in split

split -n l/10 -d 2020-03-22_clean-dataset_csv.csv subset_

This will create 10 files with names like subset_00, subset_01, etc., each containing approximately one-tenth of the original data.

You can then run Twarc hydrate on each subset separately, like this:

twarc2 hydrate subset_00 > hydrated_00.jsonl

And then you can read the different .jsonl files one by one or look for some way to merge them. (warning, untested as I cannot install twarc2)

Caridorc
  • 6,222
  • 2
  • 31
  • 46
  • I have tried your suggestion but using `split l/10` gives the error `illegal number of chunks.` if I just use the default it gives the error `too many files` I am using MacOS 13.2.1 – frogger Apr 06 '23 at 16:49
  • It works for a test .csv file on my computer. You can try some of these other options: `CHUNKS may be: N split into N files based on size of input K/N output Kth of N to stdout l/N split into N files without splitting lines/records l/K/N output Kth of N to stdout without splitting lines/records r/N like 'l' but use round robin distribution r/K/N likewise but only output Kth of N to stdout` – Caridorc Apr 07 '23 at 11:20