1

I have the following data structure:

 a= [
       [u'happy', u'thursday', u'from', u'my', u'big', u'sweater', u'and', u'this', 
        u'ART', u'@', u'East', u'Village', u',', u'Manhattan', u'https', 
        u':', u'//t.co/5k8PUInmqK'],
       [u'RT', u'@', u'MayorKev', u':', u'IM', u'SO', u'HYPEE', u'@', u'calloutband', 
        u'@', u'FreakLikeBex', u'#', u'Callout', u'#', u'TheBitterEnd', u'#',
        u'Manhattan', u'#', u'Music', u'#', u'LiveMusic', u'#', u'NYC', 
         u'#', u'NY', u'#',
        u'Jersey', u'#', u'NJ', u'http', u':', u'//t.co/0\u2026']
     ]

The way I see this, it is a list of lists of strings, except it is enveloped by a pair of [ ] rather than of ( ). The pair of [ ] is system generated as a result of:

a = [nltk.tokenize.word_tokenize(tweetL) for tweetL in tweetList]

Ultimately, I need to flatten this structure to a list of strings and conduct some regex and counting operations on the words but the outer pair of [ ] is preventing this.

I tried to use:

list.extend()

and

ll = len(a)
for n in xrange(ll):
    print 'list - ', a[n], 'number = ', n

but still get the same result:

list - [ number =  1
list - u number =  2
list - ' number =  3
list - h number =  4
list - a number =  5
list - p number =  6
list - p number =  7

As you can see, the code considers every symbol of the string as a list's element rather than considering a whole string as an element

What can be done efficiently?

tried this:

flat_list = [i for sublist in a for i in sublist] 
for i in flat_list:
    print 'element - ', i

result (partial):

element -  h
element -  a
element -  p
element -  p
element -  y
element -   
element -  t
Alvaro Silvino
  • 9,441
  • 12
  • 52
  • 80
Toly
  • 2,981
  • 8
  • 25
  • 35
  • i think there is a line in your code that is casting a as a string rather than the list of lists. its not a problem with the extra bracket – R Nar Oct 15 '15 at 23:23
  • your output also doesn't seem right, did you have another line that said `list - [ number = 0`? – zehnpaard Oct 15 '15 at 23:26
  • Possible duplicate of [Making a flat list out of list of lists in Python](http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python) – TigerhawkT3 Oct 15 '15 at 23:28
  • @zehnpaard - no I do not have that line you mention – Toly Oct 15 '15 at 23:35
  • @Shashank - that certainly removed the enveloping pair of [ ] ! But now I am getting UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 8: ordinal not in range(128) after wordTokenLw = ' '.join(map(str, wordToken)). Use encode somewhere? – Toly Oct 15 '15 at 23:38

4 Answers4

2

I am not sure I quite understand your question, let me know if I am way off, however, based on the input you provided, you have a list of lists. Not just that, but if that is the structure you always have, you can just take out what you need with

a = a[0]

That would simply give you single list.

Then you can simply just iterate as:

for i in a:
    print(i)

However, if that is just a sample, and you in fact have something like this:

[[],[],[],[]]

And you want to completely flatten that to a single list, then the comprehension you want to use is this:

flat_list = [i for sublist in a for i in sublist] 

Then you simply have a single list as: [1, 2, 3, 4]

Then you simply iterate over what you want:

for i in flat_list:
    print(i)

Alternatively, if you are wanting to print out the index as well then you can do this:

for i, v in enumerate(flat_list):
    print("{}: {}".format(i, v))

Just a final comment about your usage of extend.

extend as the help for the method states:

extend(...)
    L.extend(iterable) -- extend list by appending elements from the iterable

So, it's usage "extends" the list as is done by this example:

a = [1, 2, 3]
b = [4, 5, 6]
a.extend(b)
# a will now be [1, 2, 3, 4, 5, 6]

Running your input:

a = [[u'happy', u'thursday', u'from', u'my', u'big', u'sweater', u'and', u'this', u'ART', u'@', u'East', u'Village', u',', u'Manhattan', u'https', u':', u'//t.co/5k8PUInmqK'], [u'RT', u'@', u'MayorKev', u':', u'IM', u'SO', u'HYPEE', u'@', u'calloutband', u'@', u'FreakLikeBex', u'#', u'Callout', u'#', u'TheBitterEnd', u'#', u'Manhattan', u'#', u'Music', u'#', u'LiveMusic', u'#', u'NYC', u'#', u'NY', u'#', u'Jersey', u'#', u'NJ', u'http', u':', u'//t.co/0\u2026']]

on my code, yields this output:

0: happy
1: thursday
2: from
3: my
4: big
5: sweater
6: and
7: this
8: ART
9: @
10: East
11: Village
12: ,
13: Manhattan
14: https
15: :
16: //t.co/5k8PUInmqK
idjaw
  • 25,487
  • 7
  • 64
  • 83
  • Make that first line /// a = a[0] – Prune Oct 15 '15 at 23:23
  • @idjaw - unfortunately I tried that and some other means already. It still returns the single character at a time for my data structure (rather than a "word"). this is what I have when I tried your solution above: h a p p as a column – Toly Oct 15 '15 at 23:31
  • @Toly - It sounds like your data structure isn't what you think it is, then. – TigerhawkT3 Oct 15 '15 at 23:33
  • @Toly, TigerhawkT3 is right. Can you please validate what your data structure looks like? Because it seems like you are iterating over a string with the output you are seeing on your side. – idjaw Oct 15 '15 at 23:35
  • @idjaw - you are correct. I am very puzzled at the output of a tokenizer (of course, I am very new to this area!) The data structure is exactly what I present in the body of the question with the generating code line – Toly Oct 15 '15 at 23:42
  • @Toly . Do this for me. When you run this command `flat_list = [i for sublist in a for i in sublist] `. Update your post with exactly what is in flat_list. Also, do a print(type(flat_list)) and tell me what you get there as well. – idjaw Oct 15 '15 at 23:44
  • @Toly How is the troubleshooting going. Did you get it working? – idjaw Oct 16 '15 at 00:15
  • @idjaw - I found an error of my own. The "list of lists" [[ blah blah],[khh khh]] became [blah blah], [khh khh] after wordTokenLw = ','.join(map(str, wordToken)) wordTokenLw = wordTokenLw.lower(). However, I still can not get what I need even if I use wordTokenLw[1]. still gives only 1 character. I am very sorry for all this mess but has been driving me nuts for sometime:) – Toly Oct 16 '15 at 00:26
1
a= [[u'happy', u'thursday', u'from', u'my', u'big', u'sweater', u'and', u'this', u'ART', u'@', u'East', u'Village', u',', u'Manhattan', u'https', u':', u'//t.co/5k8PUInmqK'], [u'RT', u'@', u'MayorKev', u':', u'IM', u'SO', u'HYPEE', u'@', u'calloutband', u'@', u'FreakLikeBex', u'#', u'Callout', u'#', u'TheBitterEnd', u'#', u'Manhattan', u'#', u'Music', u'#', u'LiveMusic', u'#', u'NYC', u'#', u'NY', u'#', u'Jersey', u'#', u'NJ', u'http', u':', u'//t.co/0\u2026']]

from itertools import chain

flat_a = list(chain.from_iterable(a))

['happy', 'thursday', 'from', 'my', 'big', 'sweater', 'and', 'this', 'ART', '@', 'East', 'Village', ',', 'Manhattan', 'https', ':', '//t.co/5k8PUInmqK', 'RT', '@', 'MayorKev', ':', 'IM', 'SO', 'HYPEE', '@', 'calloutband', '@', 'FreakLikeBex', '#', 'Callout', '#', 'TheBitterEnd', '#', 'Manhattan', '#', 'Music', '#', 'LiveMusic', '#', 'NYC', '#', 'NY', '#', 'Jersey', '#', 'NJ', 'http', ':', '//t.co/0…']

print(flat_a)
LetzerWille
  • 5,355
  • 4
  • 23
  • 26
  • regretfully, I still have the same issue ['[', 'u', "'", 'h', 'a', 'p', 'p', 'y', "'", ',', ' ', 'u', "'", 't', 'h', 'u', 'r', 's', 'd', 'a', 'y', "'", ',', ' ', 'u', "'", 'f', 'r', 'o', 'm', "'", ',', ' ', 'u', "'", 'm', 'y', "'", ',', ' ', 'u', "'", 'b', 'i', 'g', "'", ',', ' ', 'u', "'", 's', 'w', 'e', 'a', as an output. I am completely puzzled why it does not work. I use Python 2.7, just in case – Toly Oct 15 '15 at 23:56
  • so in 2.7 it did not even flatten the list.. Strange it did for my in python 3. Try running this flat_a = list(chain(*a)) – LetzerWille Oct 16 '15 at 00:01
  • I take it back. It does work when there is an "envelope" by [ ] !! The issue was that the command before wordTokenLw = ' ,'.join(map(str, wordToken)) removed the envelope of [ ] and now it looks like [u'happy', u'thursday', u'from', u'my',], [u'big', u'sweater', u'and', u'this', u'art', u'@', u]. Now need to figure out how to do your operation on this structure. tried wordTokenLw[1], wordTokenLw[2] but got only u '. Sorry for my mistake!! – Toly Oct 16 '15 at 00:09
1
a= [[u'happy', u'thursday', u'from', u'my', u'big', u'sweater', u'and', u'this', u'ART', u'@', u'East', u'Village', u',', u'Manhattan', u'https', u':', u'//t.co/5k8PUInmqK'], [u'RT', u'@', u'MayorKev', u':', u'IM', u'SO', u'HYPEE', u'@', u'calloutband', u'@', u'FreakLikeBex', u'#', u'Callout', u'#', u'TheBitterEnd', u'#', u'Manhattan', u'#', u'Music', u'#', u'LiveMusic', u'#', u'NYC', u'#', u'NY', u'#', u'Jersey', u'#', u'NJ', u'http', u':', u'//t.co/0\u2026']]
for L in a:
    for e in L:
        print "element "+e


element happy
element thursday
element from
element my
element big
element sweater
element and
element this
element ART
element @
element East
skalp
  • 11
  • 1
1

A nested list comprehension should solve your first problem.

a = [token for tweetL in tweetList for token in nltk.tokenize.word_tokenize(tweetL)]

This construct lets you iterate over elements found from nested for loops. The outer most for loop always comes first, then the 2nd most outer, etc. until the inner most for loop which comes last.

It might help to understand that this is equivalent to:

a = []
for tweetL in tweetList:
    for token in nltk.tokenize.word_tokenize(tweetL):
        a.append(token)

In Python 2, you can encode unicode strings with utf-8. This will convert them from unicode type to str type, which should solve the UnicodeEncodeError.

Example:

u'\u2713'.encode('utf-8')

For more information on Python 2 Unicode, you can read here: https://docs.python.org/2/howto/unicode.html

Shashank
  • 13,713
  • 5
  • 37
  • 63
  • Thank you! Can I include unicoding into your nested statement? I do not really want to print it. The end result I am looking for is to get everything into a list of strings. My intent is to use these strings with regex (to clean-up the strings) getting ultimately some statistics. – Toly Oct 16 '15 at 01:58
  • @Toly Yes, any valid Python expression can be used on the left-most part of a list comprehension, so `token.encode('utf-8')` could easily replace `token` like so: `a = [token.encode('utf-8') for tweetL in tweetList for token in nltk.tokenize.word_tokenize(tweetL)]` – Shashank Oct 16 '15 at 02:05
  • Wow!! Absolutely Great!! Solved all my issues! Thank you Shashank and everyone who was helping me! Very much appreciated and learned a lot! – Toly Oct 16 '15 at 02:10