0

My scenario, I am trying to get particular AWS S3 stored text file word count and its language detection using AWS lambda python code. Here, below code I am trying. It is providing line count but I don't know how to get word count and language detection. Please provide some idea for get file word count and language detection.

I tried for line count

import boto3

def lambda_handler(event, context):

    # create the s3 resource
    s3 = boto3.resource('s3')

    # get the file object
    obj = s3.Object('bucket name', 'sample.txt')

    # read the file contents in memory
    file_contents = obj.get()["Body"].read()

    # print the occurrences of the new line character to get the number of lines
    # print file_contents.count('\n')
    # TODO implement
    return {
        'Line Count': file_contents.count('\n')
    }

Current Response: { "Line Count": 48, }

Expected Response: { "Line Count": 48, "Word Count": : ?, // Here I want to show word count "Language": ? // Here language name }

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
sai
  • 1
  • 2
  • You say it's not working, could you perhaps give more details about what's not working? Could you also provide a sample file and what you expect to get back from that file? – Nick Chapman Jan 09 '19 at 17:03
  • Hi @NickChapman I updated my question could you please check it? – sai Jan 09 '19 at 17:10

1 Answers1

0

To get the number of words you can try any of the things listed here: How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace?

To detect the language you can try one of the things listed here: NLTK and language detection

Unfortunately, your question is rather broad. Additionally, the task of detecting a text's language is rather difficult to get right. Getting the word count is easy but depends a lot on what you are going to define a word as.

Nick Chapman
  • 4,402
  • 1
  • 27
  • 41