0

I am quite a noob on Python and everything.

I am trying to use some NLTK for my dissertation on Applied Linguistics. But something keeps preventing the nltk tools to work on the dataset.

I've tried some codes in the copy+paste+modify style. But had no success. How should I prepare my dataset in order to apply nltk (such as, finding the percentage of punctuation for each sentence. Counting/eliminating stopwords, etc.). I've applied those features in another dataset, which are just texts, not enclosed into any of these "['']".

ds = {0: "['sentences I need to parse.']", 
      1: "['word1', 'word2', 'word3']",
      2: "['sentences and words']",
      3: "['Natural language processing.']",
      4: "['Further tokenization is needed.']",
      5: "['Is it a question?']",
      6: "['You\'re a real noob.']"}

The output I am trying to obtain is:

sentences I need to parse
word1, word2, word3
sentences and words
Natural language processing.
Further tokenization is needed.
Is it a question?
You\'re a real noob.
  • What do you actually want to do? What isn't working? If the `nltk` tools aren't working, what is the error? Your question is a bit vague. First off, I can notice that you have a dictionary containing strings, not lists, is this what you want or do you want lists? – fam-woodpecker Nov 09 '21 at 22:58
  • Stuart, thank you for the help. I'll learn how to ask proper questions on code. Thank you for being clear about it in your response. I've added the intended output above. I am trying to grab the texts instead of the whole thing like "['texts']". After failing so many times, I've deleted the entire work (dataset and notebook). I'll have to begin from scratch selecting the data out of a json file. I'd tried to parse the texts from a pandas data frame, and the error message referred to something related to the content is not string properly. – theflteacher Nov 09 '21 at 23:53
  • If you are going to redo it, I would suggest accounting for the situation we see in the last item in your example. The apostrophe is not escaped properly (the backslash \\). You can convert a string representation of a list using `eval(string)`, but in the case above it breaks. It wouldn't break if the strings are raw_strings (`r'abc'`) or there are 2 backslashes before the apostrophe (\\\). However you have drawn them out makes it to be in a poor format. But, the others all work using `[', '.join(eval(val)) for key, val in ds.items()]` – fam-woodpecker Nov 10 '21 at 00:42
  • Hi fam-woodpecker. The original source comes with tons of sentences in this format. All sentences with an apostrophe come this way. This is why I put it in the representation of the data. The closer I got from getting the raw strings was through a loop. But in just a small piece of the dataset. When I tried to apply it to the entire dataset, the loop broke. – theflteacher Nov 10 '21 at 10:07
  • Can you share come of the raw data? – fam-woodpecker Nov 10 '21 at 10:13
  • fam-woopedecker, here follows a piece of the raw data>> data_raw = {249: '["He doesn\'t like coffee."]', 250: '["My son doesn\'t like to read."]', 251: "['His daughter is three years old.']", 252: '["He doesn\'t like coffee."]', 253: "['Um homem e uma mulher.', 'Um [homem/rapaz/cara] e uma [mulher/moça].']" 254: "['Thanks, good night.', '[Thanks/Thank you/Cheers/Ta], [night/evening].', '[Thanks/Thank you/Cheers/Ta], [and/] [have a/] good [night/evening].']"} – theflteacher Nov 10 '21 at 11:30
  • How do you get that dictionary? What does the text look like at the source? – fam-woodpecker Nov 10 '21 at 22:17

0 Answers0