0

This is related to How to append to the end of an empty list?, but I don't have enough reputation yet to comment there, so I posted a new question here.

I need to append terms onto an empty list of lists. I start with:

Talks[eachFilename][TermVectors]=
      [['paragraph','1','text'],
       ['paragraph','2','text'],
       ['paragraph','3','text']]

I want to end with

Talks[eachFilename][SomeTermsRemoved]=
      [['paragraph','text'],
       ['paragraph','2'],
       ['paragraph']]

Talks[eachFilename][SomeTermsRemoved] starts empty. I can't specify that I want:

Talks[eachFilename][SomeTermsRemoved][0][0]='paragraph'
Talks[eachFilename][SomeTermsRemoved][0][1]='text'
Talks[eachFilename][SomeTermsRemoved][1][0]='paragraph'

etc... (IndexError: list index out of range). If I force populate the string and then try to change it, I get a strings are immutable error.

So, how do I specify that I want Talks[eachFilename][SomeTermsRemoved][0] to be ['paragraph','text'], and Talks[eachFilename][SomeTermsRemoved][1] to be ['paragraph','2'] etc?

.append works, but only generates a single long column, not a set of lists.

To be more specific, I have a number of lists that are initialized inside a dict

Talks = {}
Talks[eachFilename]= {}
Talks[eachFilename]['StartingText']=[]
Talks[eachFilename]['TermVectors']=[]
Talks[eachFilename]['TermVectorsNoStops']=[]

eachFilename gets populated from a list of text files, e.g.:

Talks[eachFilename]=['filename1','filename2']

StartingText has several long lines of text (individual paragraphs)

Talks[filename1][StartingText]=['This is paragraph one','paragraph two']

TermVectors are populated by the NLTK package with a list of terms, still grouped in the original paragraphs:

Talks[filename1][TermVectors]=
     [['This','is','paragraph','one'],
      ['paragraph','two']]

I want to further manipulate the TermVectors, but keep the original paragraph list structure. This creates a list with 1 term per line:

for eachFilename in Talks:
    for eachTerm in range( 0, len( Talks[eachFilename]['TermVectors'] ) ):
        for term in Talks[eachFilename]['TermVectors'][ eachTerm ]:
            if unicode(term) not in stop_words:
                Talks[eachFilename]['TermVectorsNoStops'].append( term )

Result (I lose my paragraph structure):

Talks[filename1][TermVectorsNoStops]=
     [['This'],
      ['is'],
      ['paragraph'],
      ['one'],
      ['paragraph'],
      ['two']]
Community
  • 1
  • 1
Edward
  • 107
  • 1
  • 11

2 Answers2

0

The errors you are reporting (strings immutable?) don't make any sense unless your list is actually not empty but already populated with strings. In any event, if you start with an empty list, then the simplest way to populate it is by appending:

>>> talks = {}
>>> talks['each_file_name'] = {}
>>> talks['each_file_name']['terms_removed'] = []
>>> talks['each_file_name']['terms_removed'].append(['paragraph','text'])
>>> talks['each_file_name']['terms_removed'].append(['paragraph','2'])
>>> talks['each_file_name']['terms_removed'].append(['paragraph'])
>>> talks
{'each_file_name': {'terms_removed': [['paragraph', 'text'], ['paragraph', '2'], ['paragraph']]}}
>>> from pprint import pprint
>>> pprint(talks)
{'each_file_name': {'terms_removed': [['paragraph', 'text'],
                                      ['paragraph', '2'],
                                      ['paragraph']]}}

If you have an empty list and try to assign to it by using indexing, it will throw an error:

>>> empty_list = []
>>> empty_list[0] = 10
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list assignment index out of range

As an aside, code like this:

for eachFilename in Talks:
    for eachTerm in range( 0, len( Talks[eachFilename]['TermVectors'] ) ):
        for term in Talks[eachFilename]['TermVectors'][ eachTerm ]:
            if unicode(term) not in stop_words:
                Talks[eachFilename]['TermVectorsNoStops'].append( term )

Is very far from proper Python style. Don't use camelCase, use snake_case. Don't capitalize variables. Also, in your mid-level for-loop, you use for eachTerm in range(0, len(Talks[eachFilename]['TermVectors'], but eachTerm is an int, so it makes more sense to use the standard i j or k. Even idx.

Anyway, there is no reason why that code should be turning this:

Talks[filename1][TermVectors] =
     [['This','is','paragraph','one'],
      ['paragraph','two']] 

Into this:

Talks[filename1][TermVectors] =
     [['This'],
      ['is'],
      ['paragraph'],
      ['one'],
      ['paragraph'],
      ['two']]

Here is a reproducible example (I've made this for you, BUT YOU SHOULD DO THIS YOURSELF BEFORE POSTING A QUESTION):

>>> pprint(talks)
{'file1': {'no_stops': [],
           'term_vectors': [['This', 'is', 'paragraph', 'one'],
                            ['paragraph', 'two']]},
 'file2': {'no_stops': [],
           'term_vectors': [['This', 'is', 'paragraph', 'three'],
                            ['paragraph', 'four']]}}
>>> for file in talks:
...   for i in range(len(talks[file]['term_vectors'])):
...     for term in talks[file]['term_vectors'][i]:
...       if term not in stop_words:
...         talks[file]['no_stops'].append(term)
... 
>>> pprint(file)
'file2'
>>> pprint(talks)
{'file1': {'no_stops': ['This', 'paragraph', 'one', 'paragraph'],
           'term_vectors': [['This', 'is', 'paragraph', 'one'],
                            ['paragraph', 'two']]},
 'file2': {'no_stops': ['This', 'paragraph', 'paragraph', 'four'],
           'term_vectors': [['This', 'is', 'paragraph', 'three'],
                            ['paragraph', 'four']]}}
>>> 

The more pythonic approach would be something like the following:

>>> pprint(talks)
{'file1': {'no_stops': [],
           'term_vectors': [['This', 'is', 'paragraph', 'one'],
                            ['paragraph', 'two']]},
 'file2': {'no_stops': [],
           'term_vectors': [['This', 'is', 'paragraph', 'three'],
                            ['paragraph', 'four']]}}
>>> for file in talks.values():
...   file['no_stops'] = [[term for term in sub if term not in stop_words] for sub in file['term_vectors']]
... 
>>> pprint(talks)
{'file1': {'no_stops': [['This', 'paragraph', 'one'], ['paragraph']],
           'term_vectors': [['This', 'is', 'paragraph', 'one'],
                            ['paragraph', 'two']]},
 'file2': {'no_stops': [['This', 'paragraph'], ['paragraph', 'four']],
           'term_vectors': [['This', 'is', 'paragraph', 'three'],
                            ['paragraph', 'four']]}}
>>> 
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Sorry @juanpa.arrivillaga, you are correct on the error for the empty list. I get a "IndexError: list index out of range" when the list is initially empty. I'll edit this above... – Edward Oct 16 '16 at 20:17
0

Some continued experimentation, along with the comments got me moving towards a solution. Rather than appending each individual term, which generates a single long list, I accumulated the terms into a list and then appended each list, as follows:

for eachFilename in Talks:
    for eachTerm in range( 0, len( Talks[eachFilename]['TermVectors'] ) ):
        term_list = [ ]
        for term in Talks[eachFilename]['TermVectors'][ eachTerm ]:
            if unicode(term) not in stop_words:
                term_list.append(term)
        Talks[eachFilename]['TermVectorsNoStops'].append( term )

Thanks everyone!

Edward
  • 107
  • 1
  • 11
  • Oh, so it's not `TermVectors` that loses it's shape, but `TermVectorNoStops`... you said the opposite in your question. – juanpa.arrivillaga Oct 16 '16 at 20:34
  • This is correct. My apologies for the many inconsistencies. I tried to ask the question in a brief way to save people time, where I should have simply posted my code exactly. As I'm obviously new to python, I felt that my code would be hard to follow, and it would be easier to explain my question than post confusing non-pythonic code. Thanks again for the help! – Edward Oct 16 '16 at 23:01