0

How can I convert the values in tuples from unicode to string, when the tuples are in a list of a list in Python 2.7

Here is an example for the same -

From

 unicode_list=[[(u'Mr.', u'UNK'), (u'Vinken', u'UNK'), (u'is', u'UNK'), (u'chairman', u'UNK'), (u'of', u'UNK'), (u'Elsevier', u'UNK'), (u'N.V.', u'UNK'), (u',', u'UNK'), (u'the', u'UNK'), (u'Dutch', u'UNK'), (u'publishing', u'UNK'), (u'group', u'UNK'), (u'.', u'UNK')], [(u'Rudolph', u'UNK'), (u'Agnew', u'UNK'), (u',', u'UNK'), (u'55', u'UNK'), (u'years', u'UNK'), (u'old', u'UNK'), (u'and', u'UNK'), (u'former', u'UNK'), (u'chairman', u'UNK'), (u'of', u'UNK'), (u'Consolidated', u'UNK'), (u'Gold', u'UNK'), (u'Fields', u'UNK'), (u'PLC', u'UNK'), (u',', u'UNK'), (u'was', u'UNK'), (u'named', u'UNK'), (u'*-1', u'UNK'), (u'a', u'UNK'), (u'nonexecutive', u'UNK'), (u'director', u'UNK'), (u'of', u'UNK'), (u'this', u'UNK'), (u'British', u'UNK'), (u'industrial', u'UNK'), (u'conglomerate', u'UNK'), (u'.', u'UNK')], [(u'A', u'UNK'), (u'form', u'UNK'), (u'of', u'UNK'), (u'asbestos', u'UNK'), (u'once', u'UNK'), (u'used', u'UNK'), (u'*', u'UNK'), (u'*', u'UNK'), (u'to', u'UNK'), (u'make', u'UNK'), (u'Kent', u'UNK'), (u'cigarette', u'UNK'), (u'filters', u'UNK'), (u'has', u'UNK'), (u'caused', u'UNK'), (u'a', u'UNK'), (u'high', u'UNK'), (u'percentage', u'UNK'), (u'of', u'UNK'), (u'cancer', u'UNK'), (u'deaths', u'UNK'), (u'among', u'UNK'), (u'a', u'UNK'), (u'group', u'UNK'), (u'of', u'UNK'), (u'workers', u'UNK'), (u'exposed', u'UNK'), (u'*', u'UNK'), (u'to', u'UNK'), (u'it', u'UNK'), (u'more', u'UNK'), (u'than', u'UNK'), (u'30', u'UNK'), (u'years', u'UNK'), (u'ago', u'UNK'), (u',', u'UNK'), (u'researchers', u'UNK'), (u'reported', u'UNK'), (u'0', u'UNK'), (u'*T*-1', u'UNK'), (u'.', u'UNK')], [(u'The', u'UNK'), (u'asbestos', u'UNK'), (u'fiber', u'UNK'), (u',', u'UNK'), (u'crocidolite', u'UNK'), (u',', u'UNK'), (u'is', u'UNK'), (u'unusually', u'UNK'), (u'resilient', u'UNK'), (u'once', u'UNK'), (u'it', u'UNK'), (u'enters', u'UNK'), (u'the', u'UNK'), (u'lungs', u'UNK'), (u',', u'UNK'), (u'with', u'UNK'), (u'even', u'UNK'), (u'brief', u'UNK'), (u'exposures', u'UNK'), (u'to', u'UNK'), (u'it', u'UNK'), (u'causing', u'UNK'), (u'symptoms', u'UNK'), (u'that', u'UNK'), (u'*T*-1', u'UNK'), (u'show', u'UNK'), (u'up', u'UNK'), (u'decades', u'UNK'), (u'later', u'UNK'), (u',', u'UNK'), (u'researchers', u'UNK'), (u'said', u'UNK'), (u'0', u'UNK'), (u'*T*-2', u'UNK'), (u'.', u'UNK')]]

to

 ascii_list=[[('Mr.', 'NOUN'), ('Vinken', 'NOUN'), ('is', 'VERB'), ('chairman', 'NOUN'), ('of', 'ADP'), ('Elsevier', 'NOUN'), ('N.V.', 'NOUN'), (',', '.'), ('the', 'DET'), ('Dutch', 'NOUN'), ('publishing', 'VERB'), ('group', 'NOUN'), ('.', '.')], [('Rudolph', 'NOUN'), ('Agnew', 'NOUN'), (',', '.'), ('55', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), ('and', 'CONJ'), ('former', 'ADJ'), ('chairman', 'NOUN'), ('of', 'ADP'), ('Consolidated', 'NOUN'), ('Gold', 'NOUN'), ('Fields', 'NOUN'), ('PLC', 'NOUN'), (',', '.'), ('was', 'VERB'), ('named', 'VERB'), ('*-1', 'X'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('of', 'ADP'), ('this', 'DET'), ('British', 'ADJ'), ('industrial', 'ADJ'), ('conglomerate', 'NOUN'), ('.', '.')], [('A', 'DET'), ('form', 'NOUN'), ('of', 'ADP'), ('asbestos', 'NOUN'), ('once', 'ADV'), ('used', 'VERB'), ('*', 'X'), ('*', 'X'), ('to', 'PRT'), ('make', 'VERB'), ('Kent', 'NOUN'), ('cigarette', 'NOUN'), ('filters', 'NOUN'), ('has', 'VERB'), ('caused', 'VERB'), ('a', 'DET'), ('high', 'ADJ'), ('percentage', 'NOUN'), ('of', 'ADP'), ('cancer', 'NOUN'), ('deaths', 'NOUN'), ('among', 'ADP'), ('a', 'DET'), ('group', 'NOUN'), ('of', 'ADP'), ('workers', 'NOUN'), ('exposed', 'VERB'), ('*', 'X'), ('to', 'PRT'), ('it', 'PRON'), ('more', 'ADV'), ('than', 'ADP'), ('30', 'NUM'), ('years', 'NOUN'), ('ago', 'ADP'), (',', '.'), ('researchers', 'NOUN'), ('reported', 'VERB'), ('0', 'X'), ('*T*-1', 'X'), ('.', '.')], [('The', 'DET'), ('asbestos', 'NOUN'), ('fiber', 'NOUN'), (',', '.'), ('crocidolite', 'NOUN'), (',', '.'), ('is', 'VERB'), ('unusually', 'ADV'), ('resilient', 'ADJ'), ('once', 'ADP'), ('it', 'PRON'), ('enters', 'VERB'), ('the', 'DET'), ('lungs', 'NOUN'), (',', '.'), ('with', 'ADP'), ('even', 'ADV'), ('brief', 'ADJ'), ('exposures', 'NOUN'), ('to', 'PRT'), ('it', 'PRON'), ('causing', 'VERB'), ('symptoms', 'NOUN'), ('that', 'DET'), ('*T*-1', 'X'), ('show', 'VERB'), ('up', 'PRT'), ('decades', 'NOUN'), ('later', 'ADJ'), (',', '.'), ('researchers', 'NOUN'), ('said', 'VERB'), ('0', 'X'), ('*T*-2', 'X'), ('.', '.')]]
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • Possible duplicate of https://stackoverflow.com/questions/1207457/convert-a-unicode-string-to-a-string-in-python-containing-extra-symbols – ForceBru Nov 08 '19 at 19:11
  • @ForceBru [stackoverflow.com/questions/1207457/…](https://stackoverflow.com/questions/1207457/convert-a-unicode-string-to-a-string-in-python-containing-extra-symbols) talks about converting a Unicode string to string. But here it is a case of a nested list. – Javed Anwar Nov 08 '19 at 19:20
  • But this case can be reduced to converting Unicode to string. Simply walk the data structure and convert each element to string – ForceBru Nov 08 '19 at 19:34

1 Answers1

1

You can do the conversion with a nested list comprehension:

>>> ascii_list  = [[tuple([str(x) for x in tpl]) 
...  for tpl in sublist]
...  for sublist in unicode_list]

This creates a new list of lists - the original tuples can't be converted as they are, effectively, immutable.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • I don't really understand why you'd bother doing this though - in Python 2 unicode and str are pretty much interchangeable if you're only handling ASCII. – snakecharmerb Nov 08 '19 at 19:35
  • still one bug remaining. it is converting all the 2nd values to 'UNK'. e.g., `ascii_list =[[('Mr.', 'UNK'), ('Vinken', 'UNK'), ('is', 'UNK'), ('chairman', 'UNK'), ('of', 'UNK'), ('Elsevier', 'UNK'), ('N.V.', 'UNK'), (',', 'UNK'), ('the', 'UNK'), ('Dutch', 'UNK'), ('publishing', 'UNK'), ('group', 'UNK'), ('.', 'UNK')]]` – Javed Anwar Nov 08 '19 at 20:10
  • 1
    No, you were still using the original `unicode_list` from your question. It works on the new input that you edited in afterwards too (although clearly it can't conver rUNK to NOUN or VERB without additional logic. Please don't change question requirements _after_ the question has been answered - ask a new question instead. – snakecharmerb Nov 09 '19 at 09:00