0

The title speaks for itself. This is a code excerpt using the Spacy NLP framework.

with open("text.txt") as sentences:
    docs = list(nlp.pipe(sentences.readlines()))

I tried using this package but it didn't seem to support one liners in the way that I would like.

The end goal is to be able to tell how long it will take to tokenize a very large amount of data with reasonable ETA.

How can this be accomplished?

aab
  • 10,858
  • 22
  • 38
johnrabbit
  • 13
  • 4
  • The way you do that is to time the process for a subset of your text, then scale that up based on how large your subset was. If you run 1/10,000th of your data, then multiply that time by 10,000. – Tim Roberts Jul 03 '22 at 02:43
  • @TimRoberts hi, thanks for your reply and edit. this approach assumes that all sub-inputs take the same time. is there *any* way to increase precision on this? – johnrabbit Jul 03 '22 at 03:37
  • A random subset is likely to be representative of the entire population. The only way to increase the precision is to run a larger subset. – Tim Roberts Jul 03 '22 at 03:50
  • 1
    There's a reason that every loading bar you've ever seen jumps around, calculating future work with limited known variables is a game of educated guesswork, not something that can actually be calculated with a high level of accuracy. – BeRT2me Jul 03 '22 at 05:26
  • I guess the heart of my question is, instead of timing one sub operation and extrapolating 10 thousand of them, I wasn't sure if there was any way in python to sort of scaffold into the sub processes and capture the time it takes for each to run. Therefore, one would be able to use the package I linked earlier.. it has a smoothing algorithm to deal with the jumps that you mentioned. @BeRT2me – johnrabbit Jul 03 '22 at 05:28
  • 1
    I didn't look super hard, but it looks like the author is just using `timeit` to sample and then extrapolating from there. Basically, they sample as they go and are able to quickly 'smooth' the algorithm by taking multiple samples as it goes. [code](https://github.com/rsalmei/alive-progress/blob/main/alive_progress/tools/sampling.py) – BeRT2me Jul 03 '22 at 05:49

0 Answers0