0

Maybe a bit too complex or unique question, but here is my situation: I have a text file “data.txt” that contains millions upon millions of lines. What my goal is to print the first word of every line. But normally this is gonna hours since I can’t load it into my ram.

I have tried to make a pool but that doesn’t seem to work. Also making processes doesn’t seem to work since you can’t use .readlines() yo give it the lines range to work with.

I’m completely lost so any suggestion is us very very welcome

  • How do you plan to improve the situation by creating multiple threads which all read the same file? Your bottlenecks are RAM size (so don't read all at once) and file reading speed (you cannot improve that one using threads). – zvone Oct 21 '20 at 00:08
  • 1
    Just don't read the entire file at once, but read it one line at a time, process the line and move on to the next. You can use memory-mapped files if you need more flexibility, but you probably don't really. – Grismar Oct 21 '20 at 00:12
  • _What my goal is to print the first word of every line._ Is that all you need to do? _But normally this is gonna hours since I can’t load it into my ram._ Have you done any benchmarking? – AMC Oct 21 '20 at 00:25
  • You cannot easily parallelize reading a file line-wise. The beginning of a line is the end of the previous line. You cannot find the next line until you fully read the previous one. The job is inherently sequential. – DYZ Oct 21 '20 at 00:27
  • Open the file, read the first line, print the first word, move on to the next line. Maybe this will solve it easily. – Filip Oct 21 '20 at 00:28

0 Answers0