I got a .txt file of around 3 GB, that has pre-trained Spanish words vectors you could say the format of the text is like this:
numWords numParameters
word1 num1.1 num1.2 num1.3 .... num1.300
word2 num2.1 num2.2 num2.3 .... num2.300
. . .
word1000653 num1000653.1 num1000653.2 num1000653.3 .... num1000653.300
where num stands for some number that represents a component of the vector, and each row represent a different word
the file not only contains Spanish words so I need to clean a bit the data but the file is too big to be opened by normal means, so I was thinking of dividing the file in smaller text files taking into consideration the numbers of letters that each word has, so all the words with 1 letter should be in a file named database1, 2 letters in database2 and so on, I made the code below in Python for that propose
using a smaller version of the text file that has only 5 words the code seems to work fine, but when I try to use the real file it does nothing, and again I think that the problem is that python cant open the file completely.
rows = database.read()
database.close()
rows=rows.splitlines()
numWords=len(rows)
for i in range(numWords):
rows[i]=rows[i].split(" ")
rows.pop(0)
numWords=len(rows)
print(rows)
list1=[]
for i in range(numWords):
x=len(rows[i][0])
if x == 1:
list1.append(rows[i])
list1.sort(key=lambda x: x[0])
NewNumWords=len(list1)
for i in range(NewNumWords):
list1[i]=" ".join(list1[i])
print(list1)
list1.insert(0, str(NewNumWords)+" 300")
list1="\n".join(list1)
print(list1)
file = open("vectores palabras español/palabras1.txt", "w")
file.write(list1)
file.close()