extracting data from multiple text files using python

Question

I am trying to extract data from several text files simultaneously.

import fileinput

num_lines = sum(1 for line in open('2grams.txt'))  ## in order not to print junk

count = 0
f0 = open("2gram_glues.txt", 'r')
f1 = open("2grams.txt", 'r')
f2 = open("output.txt", 'w')
f3 = open('2mwus.txt', 'r')

with fileinput.input(files=('2grams.txt', '2gram_glues.txt', '2mwus.txt')) as f:
    for line in f:
        f3.seek(0, 0)

        for line1 in f3:

            if line == line1:
                f2.write("The 2 Gram is: " + line.strip() + "\t The score is: " + f0.readline())
                count += 1
                if count >= num_lines:
                    break


f0.close()
f1.close()
f2.close()
f3.close()

both the 2grams.txt and the 2gram_glues.txt has the same number of lines and data respectively (on those lines), however, the data i actually want to write to the output file, is the data from 2mwus.txt intersecting the data in 2grams.txt that has a different number of lines.

The problem is that i want to print the 2mwus.txt concatenated with the 2gram_glues.txt (contains a score).

The scores i get from the 2gram_glues.txt are in order and not accordingly to 2mwus.txt.

what am i doing wrong with writing the data?

the link for the text files :

https://drive.google.com/folderview?id=0B1oTQq97VF44V1p3MEZwQkhqTjQ&usp=sharing

I am unable to get the sense of what your objective is and what you are doing with this program. You have opened files individually and also using fileinput.input() which adds to the confusion. Edit your question and also provide your files, the output you are getting and expected output — Sharad, Mar 08 '16 at 09:17
I opened the files individually in order to open them as objects and use them later in the code. also added the files in a link bellow — Daniel, Mar 08 '16 at 09:24

score 1 · Accepted Answer · edited May 23 '17 at 10:28

I think that you don't need to use fileinput:

num_lines = sum(1 for line in open('2grams.txt'))  ## in order not to print junk

count = 0
intersect = open('2grams.txt', 'r')
out_file = open("output.txt", 'w')
scores = open("2gram_glues.txt", 'r')

with open('2mwus.txt', 'r') as base:
    for line in base:

        line = line.rstrip()
        number = line[-2:]
        number = int(number.lstrip())

        line = line[:-2]
        line = line.rstrip()

        intersect.seek(0, 0)
        scores_lines=scores.readlines()
        scores.seek(0, 0)

        for i, line_intersect in enumerate(intersect):
            line_intersect= line_intersect.rstrip()
            if line == line_intersect:
                print("**The 2 Gram is: " + line.strip() + "\t The score is: " + scores_lines[i] +
                      'The number is ' + str(number))
                count += 1
                if count >= num_lines:
                    break

intersect.close()
out_file.close()
scores.close()

Slicing and striping

From:

'(850,·900,\t12·'
'(frequencies·850,\t4·'
'phone·but\t2·'

#\t denotes tabulation, · denotes spaces

Using:

line = line.rstrip()

Makes:

'(850,·900,\t12'
'(frequencies·850,\t4'
'phone·but\t2'

Then get the number:

number = line[-2:]

Gives:

'12'
'\t4'
'\t2'

Then left striping the number:

number = int(number.lstrip())

Gives:

12
4
2

Continuing with our "line":

'(850,·900,\t12'
'(frequencies·850,\t4'
'phone·but\t2'

Using

line = line[:-2]
line = line.rstrip()

Gives:

'(850, 900,'
'(frequencies 850,'
'phone but'

A bit harcoded, but avoid the necessity of using RegEx

Output

**The 2 Gram is: (850, 900,  The score is: 0.857143
The number is 12
**The 2 Gram is: (Bands 4    The score is: 0.4
The number is 2
**The 2 Gram is: (frequencies 850,   The score is: 1
The number is 4
**The 2 Gram is: 1, 3,   The score is: 1
The number is 8
**The 2 Gram is: 13, 25)     The score is: 0.666667
The number is 2
**The 2 Gram is: 1800, 1900  The score is: 1
The number is 8
**The 2 Gram is: 1900, 2100  The score is: 1
The number is 10
**The 2 Gram is: 5 compatible    The score is: 0.444444
The number is 2
**The 2 Gram is: A1428: UMTS/HSPA+/DC-HSDPA  The score is: 0.5
The number is 2
**The 2 Gram is: A1429: UMTS/HSPA+/DC-HSDPA  The score is: 0.4
The number is 2
**The 2 Gram is: Australia, Germany,     The score is: 1
The number is 2
**The 2 Gram is: B (800,     The score is: 1
The number is 2
**The 2 Gram is: Full specs  The score is: 1
The number is 2
**The 2 Gram is: GSM model   The score is: 0.428571
The number is 6
**The 2 Gram is: In deciding     The score is: 1
The number is 2
**The 2 Gram is: KDDI network    The score is: 0.5
The number is 2
**The 2 Gram is: South Korea).   The score is: 1
The number is 2
**The 2 Gram is: UMTS/HSPA+/DC-HSDPA (850,   The score is: 0.666667
The number is 6
**The 2 Gram is: US AT&T     The score is: 1
The number is 2
**The 2 Gram is: US, along   The score is: 1
The number is 2
**The 2 Gram is: bands 4     The score is: 0.4
The number is 2
**The 2 Gram is: bands, making   The score is: 1
The number is 2
**The 2 Gram is: battery life    The score is: 0.363636
The number is 2
**The 2 Gram is: blazing fast    The score is: 1
The number is 2
**The 2 Gram is: didn't come     The score is: 0.666667
The number is 3
**The 2 Gram is: fact that   The score is: 0.4
The number is 3
**The 2 Gram is: iPhone 5    The score is: 0.526316
The number is 5
**The 2 Gram is: meet compatibility  The score is: 1
The number is 2
**The 2 Gram is: model A1429:    The score is: 0.5
The number is 4
**The 2 Gram is: networks in     The score is: 0.258065
The number is 4
**The 2 Gram is: networks. However,  The score is: 1
The number is 2
**The 2 Gram is: one GSM.    The score is: 0.363636
The number is 2
**The 2 Gram is: phone but   The score is: 0.1
The number is 2
**The 2 Gram is: phone. This     The score is: 0.444444
The number is 2
**The 2 Gram is: release three   The score is: 0.8
The number is 2
**The 2 Gram is: sim card    The score is: 0.8
The number is 2
**The 2 Gram is: standards worldwide.    The score is: 1
The number is 2
**The 2 Gram is: support LTE     The score is: 0.296296
The number is 4
**The 2 Gram is: the phone   The score is: 0.188679
The number is 10
**The 2 Gram is: to my   The score is: 0.12
The number is 3
**The 2 Gram is: works great     The score is: 0.4
The number is 2

Ideas to take home:

Be aware of whitespaces, rstrip is you ally.
Using f1, f2 and f3 is intuitive, but in the long run you get confuse. Use meaningful names!

Hope it helps!

This is exactly the problem i wrote above. The scores you got in the output are wrong and do not correlate to the scores in the 2gram_glues.txt. this is exactly my output. — Daniel, Mar 08 '16 at 10:10
How does relate the scores in 2gram_glues.txt with the data in 2grams.txt? — Mike, Mar 08 '16 at 10:16
the score in the 1st line in 2gram_glues.txt correlate to the data in the 1st line in 2grams.txt and so on. the score for the data in 2mwus.txt doesn't correlate to these lines, this is why i'm checking "if line == line1:" in my code. The data in 2mwus.txt in the 1st line can correlate to the data and scores respectively in line 9 in 2grams.txt and 2gram_glues.txt, and so on ... — Daniel, Mar 08 '16 at 10:19
And now? I just changed 2grams.txt with 2mwus.txt, in other words, I first iterate over 2mwus instead of 2grams. — Mike, Mar 08 '16 at 10:27
so i ran your code and it has mistakes, there are missing entries from the mwus.txt. for example the first line, and also in line 7 etc. the scores seems fine, im still checking it — Daniel, Mar 08 '16 at 10:41
You are right. I just changed line = line[:-2] to line = line[:-3]. The problem is that I want to cut first the ending number, and then cut the ending whitespace, so in: ****2* it makes first ****[2*], but the problem was with ****12* because it cuts ****1[2*], now, with line[:-3] it makes ****12* and cuts ****[12*]. It would be more exhaustive using Regular Expression, but I think it is too complicated for such a simple problem. If you do string analysis regularly, at some points you will learn RegEx :) — Mike, Mar 08 '16 at 10:52
My friend you found a solution thank you. one last thing, i actually need the number coming after the 2 gram in the 2mwus.txt, the one you removed. what should i write in order to not remove it? — Daniel, Mar 08 '16 at 11:01
It is stored in "number" variable, use it as you wish. Next time take a look at RegEx, it is has a stepped learning curve, but one learned it solves all this problems easily. — Mike, Mar 08 '16 at 11:22

extracting data from multiple text files using python

1 Answers1

Slicing and striping

Output