1

I'm attempting a similar operation as shown here. I begin with reading in two columns from a CSV file that contains 2405 rows in the format of: Year e.g. "1995" AND cleaned e.g. ["this", "is, "exemplar", "document", "contents"], both columns utilise strings as data types.

    df = pandas.read_csv("ukgovClean.csv", encoding='utf-8', usecols=[0,2])

I have already pre-cleaned the data, and below shows the format of the top 4 rows:

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

    [IN] df['cleaned'].head()

   [OUT] 0    acquaint hous receiv follow letter clerk crown...
         1    ask secretari state war whether issu statement...
         2    i beg present petit sign upward motor car driv...
         3    i desir ask secretari state war second lieuten...
         4    ask secretari state war whether would introduc...
         Name: cleaned, dtype: object

Then I initialise the TfidfVectorizer:

    [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8')

Following this, calling upon the below line results in:

    [IN] x = v.fit_transform(df['cleaned'])
   [OUT] ValueError: np.nan is an invalid document, expected byte or unicode string.

I overcame this using the solution in the aforementioned thread:

    [IN] x = v.fit_transform(df['cleaned'].values.astype('U'))

however, this resulted in a Memory Error (Full Traceback).

I've attempted to look up storage using Pickle to circumvent mass-memory usage, but I'm not sure how to filter it in in this scenario. Any tips would be much appreciated, and thanks for reading.

[UPDATE]

@pittsburgh137 posted a solution to a similar problem involving fitting data here, in which the training data is generated using pandas.get_dummies(). What I've done with this is:

    [IN] train_X = pandas.get_dummies(df['cleaned'])
    [IN] train_X.shape
   [OUT] (2405, 2380)

    [IN] x = v.fit_transform(train_X)
    [IN] type(x)
   [OUT] scipy.sparse.csr.csr_matrix

I thought I should update any readers while I see what I can do with this development. If there are any predicted pitfalls with this method, I'd love to hear them.

Dbercules
  • 629
  • 1
  • 9
  • 26

1 Answers1

1

I believe it's the conversion to dtype('<Unn') that might be giving you trouble. Check out the size of the array on a relative basis, using just the first few documents plus an NaN:

>>> df['cleaned'].values
array(['acquaint hous receiv follow letter clerk crown',
       'ask secretari state war whether issu statement',
       'i beg present petit sign upward motor car driv',
       'i desir ask secretari state war second lieuten',
       'ask secretari state war whether would introduc', nan],
      dtype=object)

>>> df['cleaned'].values.astype('U').nbytes
1104

>>> df['cleaned'].values.nbytes
48

It seems like it would make sense to drop the NaN values first (df.dropna(inplace=True)). Then, it should be pretty efficient to call v.fit_transform(df['cleaned'].tolist()).

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • I implemented both of the functions that you shared (thanks), and achieved this result: `<2380x144824 sparse matrix of type '' with 11728974 stored elements in Compressed Sparse Row format>` - However, I'm wondering if I've just nullified the data being used as it's no longer related to the Year data? – Dbercules Mar 12 '18 at 17:18
  • 1
    Not sure if I'm following--you still have `df.year` as (2380,) array don't you? – Brad Solomon Mar 12 '18 at 17:25
  • 1
    Also, consider passing `stop_words='english'` to `TfidfVectorizer` to get # of features down. – Brad Solomon Mar 12 '18 at 17:26
  • You're right, apologies, I've been fittering about with this problem (in some form or another) for literal days! I'd already applied `stop_words='english''` via Snowball Stemmer when I cleaned the data, but I implemented it to `TfidfVectorizer` like you said. I'm just processing the transformation again now - after which, I'll surge forward with the rest of the Pipelining process. – Dbercules Mar 12 '18 at 17:31
  • Many thanks for your assistance thus far, however, when creating my X and y training/testing data sets (so that I can use them as sparse matrices for classification, etc), calling `X = v.fit_transform(df['cleaned'])` operates exactly as intended (as illustrated above in this thread), whereas calling `fit_transform(X, df['Year'])` produces the same MemoryError? I've tried re-applying `df.dropna` again but to no avail. This memory error seems attributed to only the `y` value. – Dbercules Mar 13 '18 at 17:50
  • With regards to my previous comment, am I right in utilising the 'cleaned' column as textual, vectorized data which is being applied on to the 'Year' column which acts as the target? – Dbercules Mar 15 '18 at 12:14