-2

I used the following code to transcribe the youtube video into text but the outcomes out a little weird. There is no space between the words and some are club together.

#import libraries
from youtube_transcript_api import YouTubeTranscriptApi as yta
import re

#select any youtube video
vid_id = 'S4lTtvlFvyk'

#extract text
data = yta.get_transcript(vid_id)

#make your transcript more better
transcript=''
for value in data:
    for key,val in value.items():
        if key == 'text':
            transcript += val
l=transcript.splitlines()
final_tra = " ".join (l)


#write out transcript in the file
file=open(r"C:\Users\user.name\Desktop\python\DATA\Video files\trans.txt",'w')
file.write(final_tra)
file.close()

And the output file looks like:

check me outthe apple engineers went to the drawingboard to build a bettermask apple actually designed their veryown mask for their employees in store towear they've actually got a coupledifferent versionsbut this is kind of the standard this iswhat most employees will be wearing it'swhat most employeesof apple will have we've got some iphone12 later case news coming at the end ofthis video so stick around for thatwilly doo pulled it off plus someviewers of the lou later show downstairsthat got in touch with him so shout outto them anonymouslythis in front of me is the officialapplemask this is the reusable face mask inmedium largefor more information please visitwelcomeforward.apple.comwhat was crazy to me is on the packagingwhich is all very apple esque as you cantellwe have what looks like a serial numberdefinitely an item number and a lotnumber and production date sojust like everything else appletremendously detailed stuff over hereand an unboxing experience that lookslike it's kind of beyond

Some words are merged with each other and doesn't create any space. Please provide the appropriate solution for the same.

James Z
  • 12,209
  • 10
  • 24
  • 44

1 Answers1

0

This may not give you exactly the output format you want but it's more concise and overcomes the word merging issue. If you dump (print) the dictionary returned by get_transcript() you'll get a better idea of what's going on.

from youtube_transcript_api import YouTubeTranscriptApi as yta
import re

# select any youtube video
vid_id = 'S4lTtvlFvyk'

# make your transcript more better
transcript = []
for value in yta.get_transcript(vid_id):
    transcript.append(value['text'])

final_tra = ' '.join(transcript)

# write out transcript in the file
with open(r'C:\Users\user.name\Desktop\python\DATA\Video files\trans.txt', 'w') as outfile:
    outfile.write(final_tra)