adding specific characters for each string a string list in python

Question

I have written a script, which basically splits all strings in a sentence into parts;

for instance;

"geldigim" -> "gel" "di" "g" "i" "m"

While some string may be split as above, some of them may be split as following;

"bildi" > "bil" "di"

or some sentences may not be split at all.

"kos" -> "kos"

It is totally decided by a function which splits the strings into parts.

What I want to do is the following:

geldigim -> /gel* *di* *g* *i* *m/
bildi -> /bil* *di/
kos -> /kos/

What I did is;

I have a corpus which has 37251512 sentences. I have written the following script;

if __name__ == "__main__":
        io = morfessor.MorfessorIO()
        print "Importing corpus ..."
        f = codecs.open("corpus/corpus_tr_en/corpus.tr", encoding="utf-8").readlines()
        print "Importing morphology model ..."
        model = io.read_binary_model_file('seg/tr/model.bin')
        corpus = open('dataset/dataset_tr_en/full_segmented.tr', 'w')
        for a in range(len(f)):
                print str(a) + ' : ' + str(len(f))
                words = f[a].replace('\n', '').split()
                line_str = ''
                for word in words:
                        segmentation = model.viterbi_segment(word)[0]
                        if len(segmentation) == 1:
                                line_str = '/' + segmentation[0] + '/'
                        if len(segmentation) == 2:
                                line_str = '/' + segmentation[0] + '* *' + segmentation[1] + '/'
                        if len(segmentation) > 2:
                                line_str = ''
                                for b in range(len(segmentation)):
                                        if (b == 0):
                                                line_str = line_str + '/' + segmentation[b] + '*'
                                        if (b != 0) and (b != (len(segmentation) - 1)):
                                                line_str = line_str + ' *' + segmentation[b] + '* '
                                        if (b == (len(segmentation) - 1)):
                                                line_str = line_str + ' *' + segmentation[b] + '/'
                        line_str = line_str + ' '
                        corpus.write(line_str.encode('utf-8'))
                corpus.write('\n')

        corpus.close()

This script loops over each sentence, and each word in a sentence, and splits it into parts with io.read_binary_model_file function.

But it is so expensive for me, it is very slow.

Could you suggest me a way which will make the process very fast?

Thanks,

What does viterbi_segment() do? Could you post the code for this function, please? /Teşekkür* *ler/ — Ukimiku, Nov 06 '16 at 20:25
It is basically a function, which basically fits the string into a machine learning model created by "Morfessor". — yusuf, Nov 06 '16 at 20:30
I was asking because if you showed us the code, maybe there is way to speed it up. — Ukimiku, Nov 06 '16 at 20:31

Jean-François Fabre · Answer 1 · 2016-11-06T20:50:49.380

What probably slows down a lot is the composition of line_str using multiple string concatenations, which are not recommended if you want performance (well it is okay for things like filename = base+".txt" but not for intensive processing.

Create line as a list instead and use str.join to create the final string just to write it to disk. Appending to a list is much faster.

And as Maximilian just suggested, you could turn your conditions to elif since they are exclusive to each other (x2). Also added some more micro-optimizations that enhance readability as well.

My proposal of how your inner loop should look like:

for word in words:
        segmentation = model.viterbi_segment(word)[0]
        lenseg = len(segmentation)
        if lenseg == 1:
                line = ['/',segmentation[0],'/']
        elif lenseg == 2:
                line = ['/',segmentation[0],'* *',segmentation[1],'/']
        elif lenseg > 2:
                line = []
                for b in range(lenseg):
                        if b == 0:
                                line += ['/',segmentation[0],'*']
                        elif b != (lenseg - 1):
                                line += [' *',segmentation[b],'* ']
                        else:
                                line+= [' *',segmentation[b],'/']
        line.append(" ")
        corpus.write("".join(line).encode('utf-8'))

Alternatives:

write each string to the output file everytime
write data to a io.StringIO object and retrieve it to write in the output file.

using elif instead of redundant ifs would probably speedup the whole thing a bit more. — Maximilian Peters, Nov 06 '16 at 20:29
right! I was so focused on the strings that I did not see that. Makes sense, although it's probably a micro-optimization compared to the string issue. And maybe the problem is in the viterbi function as well but we don't have it. At any rate, if the number of words is big, the list trick should speed the program _a lot_ (already had the problem myself with big text file) — Jean-François Fabre, Nov 06 '16 at 20:32
if we are already talking about micro-optimization, len(segmentation) is calculated 3 times, put it in a variable (same for len(f)), the 2nd if block can use some elifs as well, segmentation[b] in the first if can be written as segmentation[0]. — Maximilian Peters, Nov 06 '16 at 20:44
edited. And writing as `elif` allows to simplify the conditions a great deal. Should be on codereview... — Jean-François Fabre, Nov 06 '16 at 20:51

score 2 · Accepted Answer · edited May 23 '17 at 12:01

Jean-François Fabre covered the string optimization really well.
The other elephant is the use of readlines() for 37,251512 sentences. Just use for a in f, see here for detailed explanation.
Depending on how many duplicates there are you in your data and the performance of the model.viterbi_segment function, it might be beneficial to use a set of words instead of doing it all over for repeated words.
It seems that you are using python 2.#, in that case use xrange instead of range
.replace('\n', '').split() is slow since it has to loop over the whole line when you just want to remove the last line break (there can't be more than one in your case). You could use rstrip('\n')`
There is some reduncancy in your code, e.g. each line needs to end with / but you have it in 3 places.
All those changes might be tiny but they will add up and your code becomes easier to read as well

score 1 · Answer 3 · answered Nov 06 '16 at 21:14

1

How about inner loop like this:

line = '* *'.join(segmentation)
corpus.write(("/%s/ " % line).encode('utf-8'))

An then, since you can keep the input in memory at the same time, I would also try to keep the output in memory, and write it out in one go, maybe like this:

lines = []
for a in range(len(f)):
    print str(a) + ' : ' + str(len(f))
    words = f[a].replace('\n', '').split()
    for word in words:
        line = '* *'.join(segmentation)
        lines.append("/%s/ " % line)
corpus.write("\n".join(lines).encode('utf-8')

answered Nov 06 '16 at 21:14

Vidar

1,777
1
11
15

1

That will give quite some different output that what is asked for, e.g. missing \ and additional spaces. – Maximilian Peters Nov 06 '16 at 21:17
1

How so? Won't `/%s/ " % line` cover that? – Vidar Nov 06 '16 at 21:56
1

Sorry, my mistake, I misread your code. I guess it should work fine. – Maximilian Peters Nov 06 '16 at 22:03
Almost no difference – yusuf Nov 06 '16 at 22:33
Hmm. Then I would try to run a (line) profiler to see where it spends the most time (small/large string ops, and small/large disk I/O). For example: `python -m cProfile myscript.py -s time` – Vidar Nov 06 '16 at 22:51

adding specific characters for each string a string list in python

3 Answers3