0

I got a .txt file of around 3 GB, that has pre-trained Spanish words vectors you could say the format of the text is like this:

numWords numParameters

word1 num1.1 num1.2 num1.3 .... num1.300

word2 num2.1 num2.2 num2.3 .... num2.300

. . .

word1000653 num1000653.1 num1000653.2 num1000653.3 .... num1000653.300

where num stands for some number that represents a component of the vector, and each row represent a different word

the file not only contains Spanish words so I need to clean a bit the data but the file is too big to be opened by normal means, so I was thinking of dividing the file in smaller text files taking into consideration the numbers of letters that each word has, so all the words with 1 letter should be in a file named database1, 2 letters in database2 and so on, I made the code below in Python for that propose

using a smaller version of the text file that has only 5 words the code seems to work fine, but when I try to use the real file it does nothing, and again I think that the problem is that python cant open the file completely.

rows = database.read()
database.close()
rows=rows.splitlines()
numWords=len(rows)

for i in range(numWords):
    rows[i]=rows[i].split(" ")

rows.pop(0)
numWords=len(rows)
print(rows)

list1=[]
for i in range(numWords):
    x=len(rows[i][0])
    if x == 1:
     list1.append(rows[i])

list1.sort(key=lambda x: x[0])
NewNumWords=len(list1)


for i in range(NewNumWords):
    list1[i]=" ".join(list1[i])

print(list1)
list1.insert(0, str(NewNumWords)+" 300")
list1="\n".join(list1)

print(list1)
file = open("vectores palabras español/palabras1.txt", "w")
file.write(list1)
file.close() 
  • "but the file is too big to be opened by normal means" ... and yet your program opens the file and reads the entire thing into memory. See [how to read a large file line by line](https://stackoverflow.com/questions/8009882/how-to-read-a-large-file-line-by-line). – chash Jul 07 '20 at 17:03

1 Answers1

1

If this is useful to you, this will divide a file into files for different lengths of words (first word of each line), without reading the whole file into memory.

This will use a dictionary to store open filehandles for the different lengths of words. If a word with a length not already encountered is found, then it will open the output file for writing and cache the file handle in the dictionary.

It might not be 100% equivalent to what your code was doing, (in particular, sorting is not easily possible), but it will be similar.

files = {}

with open("database.txt") as f:
    next(f)  # ignore first line
    for line in f:
        length = len(line.split()[0])

        if length not in files:
            files[length] = open(f"palabras{length}.txt", "w")
        
        files[length].write(line)

for fh in files.values():
    fh.close()
alani
  • 12,573
  • 2
  • 13
  • 23
  • this method kinda works, it makes some files with words in them just like I wanted but, as an example, the file palabras1 has only one word in it, palabras2 has 4 words in it, and palabras6 has only one word in it, so in the end, it doesn't go through the whole txt file, somehow I wonder if the problem is again the size, the file in question can be found here: https://www.kaggle.com/rtatman/pretrained-word-vectors-for-spanish – Miguel Fernando Macias Macias Jul 08 '20 at 00:19
  • @MiguelFernandoMaciasMacias I've added a skip for the first line, but apart from that, the code should write every line to one output file or another. The size should not be a problem because it does not store them all in memory. Did the program stop with any kind of error message? – alani Jul 08 '20 at 00:26
  • how would you merge the files in one after splitting them?? – Miguel Fernando Macias Macias Jul 19 '20 at 03:47