I'm attempting a similar operation as shown here. I begin with reading in two columns from a CSV file that contains 2405 rows in the format of: Year e.g. "1995" AND cleaned e.g. ["this", "is, "exemplar", "document", "contents"], both columns utilise strings as data types.
df = pandas.read_csv("ukgovClean.csv", encoding='utf-8', usecols=[0,2])
I have already pre-cleaned the data, and below shows the format of the top 4 rows:
[IN] df.head()
[OUT] Year cleaned
0 1909 acquaint hous receiv follow letter clerk crown...
1 1909 ask secretari state war whether issu statement...
2 1909 i beg present petit sign upward motor car driv...
3 1909 i desir ask secretari state war second lieuten...
4 1909 ask secretari state war whether would introduc...
[IN] df['cleaned'].head()
[OUT] 0 acquaint hous receiv follow letter clerk crown...
1 ask secretari state war whether issu statement...
2 i beg present petit sign upward motor car driv...
3 i desir ask secretari state war second lieuten...
4 ask secretari state war whether would introduc...
Name: cleaned, dtype: object
Then I initialise the TfidfVectorizer:
[IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
Following this, calling upon the below line results in:
[IN] x = v.fit_transform(df['cleaned'])
[OUT] ValueError: np.nan is an invalid document, expected byte or unicode string.
I overcame this using the solution in the aforementioned thread:
[IN] x = v.fit_transform(df['cleaned'].values.astype('U'))
however, this resulted in a Memory Error (Full Traceback).
I've attempted to look up storage using Pickle to circumvent mass-memory usage, but I'm not sure how to filter it in in this scenario. Any tips would be much appreciated, and thanks for reading.
[UPDATE]
@pittsburgh137 posted a solution to a similar problem involving fitting data here, in which the training data is generated using pandas.get_dummies(). What I've done with this is:
[IN] train_X = pandas.get_dummies(df['cleaned'])
[IN] train_X.shape
[OUT] (2405, 2380)
[IN] x = v.fit_transform(train_X)
[IN] type(x)
[OUT] scipy.sparse.csr.csr_matrix
I thought I should update any readers while I see what I can do with this development. If there are any predicted pitfalls with this method, I'd love to hear them.