0

I have a file with almost 100000 lines. I want to make a cleanning process (lower case, remove stopwords etc) However it takes time.

Example for 10000 the script needs 15 minutes. For all file I expect to take 150 minutes. However it takes 5 hours.

In the start the file use for reading this:

fileinput = open('tweets.txt', 'r')

lines = fileinput.read().lower() #for lower case, however it load all file

for line in fileinput:
    lines = line.lower() 

Question: Can I use a way to read the first 10000 lines making the process of cleaning and after that reading the next blog of lines etc?

Joe Kalvos
  • 13
  • 1
  • 1
    This might be helpful: http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python – Alex L Jan 04 '13 at 09:55

3 Answers3

2

I would highly suggest operating line-by-line instead of reading in the entire file all at once (in other words, don't use .read()).

with open('tweets.txt', 'r') as fileinput:
    for line in fileinput:
        line = line.lower()
        # ... do something with line ...
        # (for example, write the line to a new file, or print it)

This will automatically take advantage of Python's built-in buffering capabilities.

Amber
  • 507,862
  • 82
  • 626
  • 550
  • With this I make the process for every line. Could this take more time? – Joe Kalvos Jan 04 '13 at 11:08
  • Depends on the process. In the average case, any extra time from additional function calls would be more than made up for by the time you save by using file buffering. – Amber Jan 04 '13 at 18:32
1

Try to work on the file one line at a time:

lowered = []    

with open('tweets.txt', 'r') as handle:
    for line in handle:
        # keep accumulating the results ...
        lowered.append(line.lower())
        # or just dump the to stdout right away
        print(line)

for line in lowered:
    # print or write to file or whatever you require

That way you reduce the memory overhead, which, in case of large files might lead to swapping and kill performance.

Here are some benchmarks on a file with about 1M lines:

# (1) real 0.223    user 0.195  sys 0.026   pcpu 98.71
with open('medium.txt') as handle:
  for line in handle:
      pass

# (2) real 0.295    user 0.262  sys 0.025   pcpu 97.21
with open('medium.txt') as handle:
    for i, line in enumerate(handle):
        pass
    print(i) # 1031124

# (3) real 21.561 user 5.072  sys 3.530   pcpu 39.89
with open('medium.txt') as handle:
    for i, line in enumerate(handle):
        print(line.lower())

# (4) real 1.702  user 1.605  sys 0.089   pcpu 99.50
lowered = []
with open('medium.txt') as handle:
    for i, line in enumerate(handle):
        lowered.append(line.lower())

# (5) real 2.307  user 1.983  sys 0.159   pcpu 92.89
lowered = []
with open('medium.txt', 'r') as handle:
    for i, line in enumerate(handle):
        lowered.append(line.lower())

with open('lowered.txt', 'w') as handle:
    for line in lowered:
        handle.write(line)

You can also iterator over two files at once:

# (6) real 1.944  user 1.666  sys 0.115   pcpu 91.59
with open('medium.txt', 'r') as src, open('lowered.txt', 'w') as sink:
    for i, line in enumerate(src):
        sink.write(line.lower())

Results as table:

# (1) noop                   0.223
# (2) w/ enumerate           0.295
# (4) list buffer            1.702
# (6) on-the-fly             1.944
# (5) r -> list buffer -> w  2.307
# (3) stdout print          21.561
miku
  • 181,842
  • 47
  • 306
  • 310
  • Better would be to write out or print the lines as they are processed, so you don't have to buffer the entire list of processed lines in memory. – Amber Jan 04 '13 at 10:01
  • @Amber, yes, I added a note. – miku Jan 04 '13 at 10:03
0

Change your script as follows:

with open('tweets.txt', 'r') as fileinput:
  for line in fileinput:
    """do what you need to do with each line"""
    line = line.lower()

So, basically, don't read in the whole file into memory using read(), just iterate over lines of the opened file. When you read a huge file into memory your process may grow to a point where the system needs to swap out parts of it, and that will make it very slow.

piokuc
  • 25,594
  • 11
  • 72
  • 102
  • There's no reason to use `.readlines()` - you can just iterate over the file object itself. – Amber Jan 04 '13 at 09:59