0

I have 340,000 line file data , when I reading file with Python 3.5 timing is good but when I run it with Python 2.7 reading is very slow , don't have any clue what is going on here, here is code:

import codecs as cds

INPUT_DATA_DIR = 'some_tsv_file.tsv'
ENT = "entities"

def train_data_getter(input_dir=INPUT_DATA_DIR):
    file_h = cds.open(input_dir,  encoding='utf-8')
    data = file_h.read()
    file_h.close()
    sentences = data.split("\n\n")
    parsed_data = parser(sentences[0])
    return parsed_data

def parser(raw_data):
    words = [line for line in raw_data.split("\n")]
    temp_l = []
    temp_s = ""
    for word in words:
        token, ent = word.split('\t')
        temp_s += token
        temp_s += " "
        temp_l.append(ent)
    data = [(temp_s), {ENT: temp_l}]
    return data

Edit

Thanks to @PM 2Ring , problem was string concatenation inside a for loop but still reason for huge difference between Python2.7 and 3.5 is not clear for me.

smci
  • 32,567
  • 20
  • 113
  • 146
ᴀʀᴍᴀɴ
  • 4,443
  • 8
  • 37
  • 57
  • So what is you question? – Niklas Mertsch Apr 29 '18 at 07:55
  • I think his question is why does it run so much faster in 3.5. – Jack Ryan Apr 29 '18 at 07:55
  • as it is clear , WHY there is so much timing difference. – ᴀʀᴍᴀɴ Apr 29 '18 at 07:56
  • 2
    Where is it slow? You might ask yourself what could potentially be the expensive operations here, and do some timing to figure out exactly what function calls are slowing down your program in 2.7. – Jack Ryan Apr 29 '18 at 07:58
  • What does "good" mean, what does "very slow" mean. Please show some actual metrics instead of a qualitative scale. – user3483203 Apr 29 '18 at 07:58
  • @JackRyan The time consuming part is `parser` function `for` loop – ᴀʀᴍᴀɴ Apr 29 '18 at 08:02
  • 5
    This may have little bearing on the speed difference, but doing string concatenation in a loop isn't a great idea. So change that `temp_s` stuff to accumulate the strings in a list (like you do with `temp_l`), and then at the end of the loop `.join` the list into a string. – PM 2Ring Apr 29 '18 at 08:02
  • @chrisz , good and very slow are two relative words , metrics as I mentioned is time , one very short time and one very long time numbers! – ᴀʀᴍᴀɴ Apr 29 '18 at 08:04
  • @PM2Ring , that is correct idea , but why huge differ is happens here? – ᴀʀᴍᴀɴ Apr 29 '18 at 08:05
  • 2
    @PM2Ring: I think that could have a _lot_ of bearing on the speed difference! (Various Python versions have hacks that prevent particular common forms of repeated concatenation having quadratic running time; I don't remember which version(s) the hack(s) were introduced in, though.) – Mark Dickinson Apr 29 '18 at 08:06
  • 1
    @Arman I'm not sure, but Python 3 has made numerous improvements in various places, including some optimizations of common string operations. However, Python 2.7 already _has_ some optimizations with concatenation of short strings. Try running that code on an early version of Python 2 and you'll really see a big slowdown. :) – PM 2Ring Apr 29 '18 at 08:08
  • 2
    @MarkDickinson I'm not sure either, but I know some optimization in that direction was done prior to Python 3; I remember Alex Martelli was _not_ pleased: see https://stackoverflow.com/a/1350289/4014959 – PM 2Ring Apr 29 '18 at 08:11
  • @PM2Ring Ha; that's a fantastic answer. Thanks for the link. – Mark Dickinson Apr 29 '18 at 08:14
  • **pandas supports fast read of csv/tsv**, see my answer. Suggest `chunksize=100000` There are tons of existing answers on that. Just don't do that iterative string append 340,000 times. It's kind of a red herring to argue about which version accelerates inefficient code like that. – smci Apr 29 '18 at 08:25
  • [How to read a 6 GB csv file with pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas), [Pandas: Reading TSV into DataFrame](https://stackoverflow.com/questions/44503190/pandas-reading-tsv-into-dataframe), [How to I load a tsv file into a Pandas DataFrame?](https://stackoverflow.com/questions/9652832/how-to-i-load-a-tsv-file-into-a-pandas-dataframe) ... – smci Apr 29 '18 at 08:32

1 Answers1

1

Your iteratively doing append 340,000 times inside a loop is woefully inefficient, so just don't do it. Anyway, pandas supports tsv read, it will be more performant, and supports chunksize argument for fast read of large csv/tsv files:

import pandas
train = pd.read_table('some_tsv_file.tsv', delim_whitespace=True, chunksize=100000)
# you probably need encoding='utf-8'. You may also need to tweak the settings for header, skiprows etc. Read the doc.
smci
  • 32,567
  • 20
  • 113
  • 146
  • Thanks for your answer , with DON'T doing string concat in for the timing issue has resolved, my major problem is this difference between Python(s). – ᴀʀᴍᴀɴ Apr 29 '18 at 08:52
  • 1
    Arman, just install pandas and run pd.read_table/csv already. The speed difference should amaze you. You should never be relying on the interpreter to optimize really inefficient code. – smci Apr 29 '18 at 09:34