0

I'm struggling to convert an array into individual tokens. Currently I used the following code, but am not getting to the exact output that I want. As I would like the numbers to be part of it too.

text = df.head(3)[['processed_arti', 'cluster']].values    // where df is a pandas dataframe

terms = [b for l in text for b in zip (l[0].split(" "))]

enter image description here

I've added another picture below showing a bit more detail of how the data looks. Read in a pandas dataframe.

enter image description here

I'd really appreciate any help on this. Thanks in advance.

ALK
  • 87
  • 1
  • 2
  • 9
  • Could you please provide a MRE? stackoverflow.com/help/minimal-reproducible-example – Rafael Valero Feb 09 '21 at 18:18
  • 1
    `terms = [b for l in text for b in itertools.product(l[0].split(" "), l[1])]` ? `import itertools` – Epsi95 Feb 09 '21 at 18:20
  • Thanks. Could you please provide sample in python? – Rafael Valero Feb 09 '21 at 18:28
  • Thank you @RafaelValero your responses. I've added a few more details in the above question. Thank you. – ALK Feb 09 '21 at 18:33
  • Thank you @Epsi95 for your response. I get the following error when I try itertools - "TypeError: 'int' object is not iterable" – ALK Feb 09 '21 at 18:34
  • @ALK, if you could please just copy and paste the code instead o pics that would be great. If you place pics them people have to write down themself the code you actually already have. – Rafael Valero Feb 09 '21 at 18:36
  • Thank you @RafaelValero for the recommendation and the help. I am sorted now. – ALK Feb 09 '21 at 18:44

2 Answers2

2

Isn't this what you need? You just need to add the number alongside your words:

terms = [(b, n) for l, n in text for b in l.split(" ")]
Yevhen Kuzmovych
  • 10,940
  • 7
  • 28
  • 48
1

First you get a list of lists contains your tuples:

[[(word, l[1]) for word in l[0].split('0')] for l in a] # a being your array.

Then you flatten the list of lists: see How to make a flat list out of list of lists?

Or better, as Yevhen Kuzmovych suggested:

[(word, l[1]) for l in a for word in l[0].split('0')]

Note: Not verified. Typed on my mobile.

Tarik
  • 10,810
  • 2
  • 26
  • 40