I have the following data structure:
a= [
[u'happy', u'thursday', u'from', u'my', u'big', u'sweater', u'and', u'this',
u'ART', u'@', u'East', u'Village', u',', u'Manhattan', u'https',
u':', u'//t.co/5k8PUInmqK'],
[u'RT', u'@', u'MayorKev', u':', u'IM', u'SO', u'HYPEE', u'@', u'calloutband',
u'@', u'FreakLikeBex', u'#', u'Callout', u'#', u'TheBitterEnd', u'#',
u'Manhattan', u'#', u'Music', u'#', u'LiveMusic', u'#', u'NYC',
u'#', u'NY', u'#',
u'Jersey', u'#', u'NJ', u'http', u':', u'//t.co/0\u2026']
]
The way I see this, it is a list of lists of strings, except it is enveloped by a pair of [ ] rather than of ( ). The pair of [ ] is system generated as a result of:
a = [nltk.tokenize.word_tokenize(tweetL) for tweetL in tweetList]
Ultimately, I need to flatten this structure to a list of strings and conduct some regex and counting operations on the words but the outer pair of [ ] is preventing this.
I tried to use:
list.extend()
and
ll = len(a)
for n in xrange(ll):
print 'list - ', a[n], 'number = ', n
but still get the same result:
list - [ number = 1
list - u number = 2
list - ' number = 3
list - h number = 4
list - a number = 5
list - p number = 6
list - p number = 7
As you can see, the code considers every symbol of the string as a list's element rather than considering a whole string as an element
What can be done efficiently?
tried this:
flat_list = [i for sublist in a for i in sublist]
for i in flat_list:
print 'element - ', i
result (partial):
element - h
element - a
element - p
element - p
element - y
element -
element - t