I have 326 text documents in a list as strings separated by commas. I wanted to split them into sentences so I could run them through trained machine learning models. Normally I work with sentences instead of documents so I'm a bit lost. I started by splitting on periods (only_10ks_by_ticker is the name of the list):
for i, document in enumerate(only_10ks_by_ticker):
only_10ks_by_ticker[i] = document.split('.')
It seems to have worked, and now within the list there is 1 list for each documents with strings of sentences. But, I cannot figure out how to apply a function now to each sentence and then retain the structure of the list of lists. I can combine all the sentences of every document into one big list, but I want to be able to know which sentences are part of which of the 326 documents. Here is what I tried (preprocess is the name of the function I want to apply to each sentence):
tokenized_10k_2_attempt3 = []
for i, document in enumerate(only_10ks_by_ticker):
for sentence in document:
tokenized_10k_2_attempt3.append(preprocess(sentence))
This works but puts all the sentences in one big list and thus loses the information of which sentences are in which documents. I also tried this:
tokenized_10k_2_attempt3 = []
for i, document in enumerate(only_10ks_by_ticker):
for sentence in document:
tokenized_10k_2_attempt3[i].append(preprocess(sentence))
But got an index error. Thanks for any help!
EDIT: I also tried just changing the original list:
for i, document in enumerate(only_10ks_by_ticker):
for j, sentence in enumerate(document):
only_10ks_by_ticker[i][j] = preprocess(sentence)
Still doesn't work.
EDIT 2: In case anyone every needs this info, it turns out the solution is much, MUCH simpler than I realised. I just needed another set of brackets in the list comp to maintain the original structure:
tokenized_10k_22 = [[preprocess(sentence) for sentence in document] for document in only_10ks_by_ticker]
According to the tqdm library, running this took basically the exact same time as the function method given below, so I guess they work pretty much the same.
Thanks again to everyone, I learnt a lot about embedded data structures and how to deal with them :)