0

I am working on a project which involves big data stored in a .txt files. My program is running a little bit slow. A reason to that I think is that my program parses the file in a non-efficient manner.

FILE SAMPLE:

X | Y | Weight
--------------

1  1  1
1  2  1
1  3  1
1  4  1
1  5  1
1  6  1
1  7  1
1  8  1
1  9  1
1  10  1

PARSER CODE:

def _parse(pathToFile):
    with open(pathToFile) as f:
    myList = []
    for line in f:
        s = line.split()
        x, y, w = [int(v) for v in s]
        obj = CoresetPoint(x, y, w)
        myList.append(obj)
    return myList

This function is invoked NumberOfRows/N times, as I only parse a small chunk of data to process until no lines are left. My .txt is several Giga Bytes.

I can obviously see that I iterate NumberOfLines times in the loop and this is a huge bottleneck and BAD. Which leads me to my question:

Question: What is the right approach to parse a file, what would be the most efficient way to do so and will organizing the data differently in the .txt fasten the parser ? if so, how should I organize the data inside the file ?

Tony Tannous
  • 14,154
  • 10
  • 50
  • 86

1 Answers1

1

In Python you have a library to do this called Pandas. Import the data with Pandas in the following way:

import pandas as pd
df = pd.read_csv('<pathToFile>.txt')

In case the file is too big to be loaded all together into memory, you could loop through parts of the data and load them one at the time. Here a pretty good blog post that can help you do that.

lorenzori
  • 737
  • 7
  • 23
  • I can't have the whole file on main memory as its very large, won't this bring it to main memory ? – Tony Tannous Jan 31 '17 at 10:00
  • yes this will take it into memory. How big is it? If you really need than you should go into the distributed stuff, like Spark's RDDs but that would take some time. What about sampling the data? look at this question: http://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame, you could loop over parts of the data so not to load all into memory at once. – lorenzori Jan 31 '17 at 11:35