0

I have 326 text documents in a list as strings separated by commas. I wanted to split them into sentences so I could run them through trained machine learning models. Normally I work with sentences instead of documents so I'm a bit lost. I started by splitting on periods (only_10ks_by_ticker is the name of the list):

for i, document in enumerate(only_10ks_by_ticker):
    only_10ks_by_ticker[i] = document.split('.')

It seems to have worked, and now within the list there is 1 list for each documents with strings of sentences. But, I cannot figure out how to apply a function now to each sentence and then retain the structure of the list of lists. I can combine all the sentences of every document into one big list, but I want to be able to know which sentences are part of which of the 326 documents. Here is what I tried (preprocess is the name of the function I want to apply to each sentence):

tokenized_10k_2_attempt3 = []

for i, document in enumerate(only_10ks_by_ticker):
    for sentence in document:
        tokenized_10k_2_attempt3.append(preprocess(sentence))

This works but puts all the sentences in one big list and thus loses the information of which sentences are in which documents. I also tried this:

tokenized_10k_2_attempt3 = []

for i, document in enumerate(only_10ks_by_ticker):
    for sentence in document:
        tokenized_10k_2_attempt3[i].append(preprocess(sentence))

But got an index error. Thanks for any help!

EDIT: I also tried just changing the original list:

for i, document in enumerate(only_10ks_by_ticker):

    for j, sentence in enumerate(document):
        only_10ks_by_ticker[i][j] = preprocess(sentence)

Still doesn't work.

EDIT 2: In case anyone every needs this info, it turns out the solution is much, MUCH simpler than I realised. I just needed another set of brackets in the list comp to maintain the original structure:

tokenized_10k_22 = [[preprocess(sentence) for sentence in document] for document in only_10ks_by_ticker]

According to the tqdm library, running this took basically the exact same time as the function method given below, so I guess they work pretty much the same.

Thanks again to everyone, I learnt a lot about embedded data structures and how to deal with them :)

Nore Patel
  • 35
  • 7
  • the index error occurs because your have an empty list `tokenized_10k_2_attempt3` – Andrex Nov 29 '19 at 19:51
  • How should I fill it so the list `tokenized_10k_2_attempt3` will be like the original list but with the function applied to each sentence? Thanks! – Nore Patel Nov 29 '19 at 19:56
  • 1
    why you dont do a copy of your `only_10ks_by_ticker` say `only_10ks_by_ticker_copy`, then you can do `only_10ks_by_ticker_copy = [function(elem) for elem in only_10ks_by_ticker_copy]` where `function` is a function whose input is a list (list of sentences)????. Maybe I'm not understanding the whole problem – Andrex Nov 29 '19 at 19:59
  • Ok, let's say I don't mind modifying the original list instead of creating a new list. What is the code to do that? I tried the code in the edit above, but it still didn't work even if I don't make a new list. – Nore Patel Nov 29 '19 at 20:09
  • did u see the answer i did below? – Andrex Nov 29 '19 at 20:12
  • Yes, it worked :) I left a comment in your reply as well – Nore Patel Nov 29 '19 at 20:24

3 Answers3

1
def preprocess_document(document: list):
    document = [preprocess(sentence) for sentence in document]
    return document

tokenized_10k_2_attempt3 = [preprocess_document(document) for document in only_10ks_by_ticker]

Maybe i didn't understand?????

Andrex
  • 602
  • 1
  • 7
  • 22
  • This worked! Thanks so much :) Could you explain when you use functions to do this instead of using loops? I have been trying countless embedded loops and list comprehensions to do this. It never occurred to me to build another function. Also what does the (document: list) argument in the preprocess_document function mean? I haven't seen a colon before in the arguments of a function. Thanks! – Nore Patel Nov 29 '19 at 20:22
  • @NorePatel I didn't write the answer, but I can still try to answer your questions! There are many things that functions can do with are impossible with just list comprehensions, so that's one scenario where the former is better than the latter. In this case, OP probably made a function for the sake of clarity and legibility. As for the colon, it indicates a [type annotation](https://stackoverflow.com/q/32557920/11301900). – AMC Nov 30 '19 at 04:51
  • Ah okay, so the type annotation/hint doesn't restrict the type of the argument document just gives a hint for it? In other words, if it was just `def preprocess_document(document)` it would still work just fine? Also, do you have any tuts or articles that might explain some of the things functions can do that for loops and list comps cannot? I feel like ever since I started programming, I know a lot of *stuff* but never know when to apply what lol. Thanks! – Nore Patel Nov 30 '19 at 10:38
0

You can simply get an element of a 2N array by indexing both dimensions of it: a[i][j]. In your case every sentence is accessible by only_10ks_by_ticker[i][j] which i and j are valid integer indexes.

For your example you can use two nested loops for that matter:

for doc in only_10ks_by_ticker:
    for sentence in doc:
        # do anything to sentence here. It's string by the way.

You can also use range for indexing:

for i in range(len(only_10ks_by_ticker)):
    for sentence in only_10ks_by_ticker[i]:
        tokenized_10k_2_attempt[i].append(preprocess(sentence))
Hamidreza
  • 1,465
  • 12
  • 31
  • Yeah, that's what I did above (in the code snippets). I don't have an issue accessing the sentences, the issue is returning them back into the original form (with the sentences of each document separated from sentences from other documents). I want to apply the function to each sentence in the list of documents and have a new list that has the same structure as the old. – Nore Patel Nov 29 '19 at 19:58
  • @NorePatel I don't know if I get it or not, but try the second nested loops I wrote in edit. If it's not your solution try to explain you problem more clear. – Hamidreza Nov 29 '19 at 20:12
0

Hopefully, I've understood the structure on your list (see my example doc_list). My mymethod would be the equivalent to your preprocess method.

def mymethod(str):
  return str.upper();


doc_list = [
  "document one. sentence one, ",
  "document two. sentence two",
  "document three. sentence three"
]
new_doc = ['.'.join([mymethod(sent) for sent in x.split('.')]) for x in doc_list]

import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(new_doc)

So, the code first iterates through the list, and then each item is split by the period character, sent into the mymethod function and put into new list; the list is then join into a string separated by the same character (period) and finally put into the the final list being sent back.

This is my result:

[   'DOCUMENT ONE. SENTENCE ONE, ',    'DOCUMENT TWO. SENTENCE TWO',    'DOCUMENT THREE. SENTENCE THREE']

Hopefully, what you need !

Eric Day
  • 142
  • 1
  • 10
  • This would probably work too. I just realised you can have multiple brackets inside a list comprehension to maintain the original structure. One question though, why is it '.' before the .join() method instead of just of a space inbetween like this ' '? That's what I typically notice with the join method. Thanks! – Nore Patel Nov 30 '19 at 10:42
  • Hey, thanks for this response, it helped me realise I was just missing a set of brackets in my original list comp and it worked fine (needed another set of brackets around the for sentence in document part, you can check what I mean in my original post in Edit 2). Thanks again! :) – Nore Patel Nov 30 '19 at 12:03