I have a list of tweets in a document named twfile.txt. A sample txt file may look like this:
RT @CriticalReading: How #Islamophobia works. #Germanwings http://t.co/rX6XVxARiD
Family of Australian victims visit the #Germanwings #GermanWingsCrash crash site in #FrenchAlps #A320Crash #A320 http://t.co/ztReJ1tifU
RT @morningshowon7: #Germanwings: Australian relatives have visited the memorial site in the French alps. #TMS7 http://t.co/BmfiLxHPkC
Three generations from the same family were killed in the #Germanwings Alps crash: http://t.co/6F5MgvBSZG http://t.co/HzJZCZKVZe
Alps crash pilot's hidden illness sparks medical privacy debate #Germanwings. http://t.co/Efe89rxwJG
#Germanwings crash: church in #AndreasLubitz's home town stands by his family http://t.co/QkePs5sG4W http://t.co/irdDnHhxF7
Breaking: #Germanwings co-pilot had been treated 4 suicidal tendencies: http://t.co/6qEynKMSEI/s/KJKu http://t.co/TVdqP4EeWu/s/b4vR @Reuters
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
Audio last 60 seconds from flight deck http://t.co/T4IYK26NrG #Germanwings #GermanWingsCrash #GermanyWings #4U9525 #AndreasLubitz
#Germanwings: Australian relatives have visited the memorial site in the French alps. #TMS7 http://t.co/BmfiLxHPkC
RT @surfinwav: American intelligence contractor among those killed in Alps plane crash http://t.co/m4L0EOd9L2 #Germanwings #GermanWingsCrash
Excellent help & resources from our friends @MindframeMedia over responsible reporting re #Germanwings http://t.co/EQG0kxyQgd #NoStigma
.@Boba71 @Reuters So in Germany any sick psycho can fly a commercial plane hiding behind the so called privacy laws? #germanwings
The World Will Never Forget https://t.co/Th41xouUiS #4U9525 #GermanWings #A320Crash #indeepsorrow #AndreasLubitz
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
I am uncomfortable using word 'depression' for the #Germanwings pilot, depression does not kill other people.
Google Maps has blurred out the home of #Germanwings crash pilot Andreas Lubitz. http://t.co/VTm5sfmT6e
#Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/YpDB8trKFL http://t.co/uML8h6vwD8
#Lufthansa #Germanwings prepare for negligence charges since copilot was known to be suicidal 7 years ago
ICYMI: @swaindiana's interview w. lawyer who represents 4 families, who lost loved ones in #Germanwings crash. http://t.co/dnUXKkCD46 #CBCNN
An airplane crashes, after a couple of HOURS we get who's guilty, with the perfect solution for everybody. I don't buy it. #Germanwings
#Germanwings Crash Settlements Are Likely to Vary by Passenger Nationality - #aviationlaw #montrealconvention http://t.co/MWM8nSEYwG
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
German prosecutors confirm #Germanwings pilot "had continued to see psychiatrists and neurologists until recently" http://t.co/ma1v9zeiIV
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
RT @MindframeMedia: MEDIA: tips when including #mentalillness in stories to avoid perpetuating #stigma http://t.co/W7RlJVe9Lq #Germanwings
#Germanwings plane crash in French Alps: First clues - CNN : http://t.co/AbMPbXFfjG
RT @MindframeMedia: MEDIA: Get to know the facts about #mentalillness & avoiding stigmatising stories http://t.co/ZDd7AFOAir #Germanwings
RT @michaelhallida4: Am I Mad Enough To Crash A Plane Into A Mountain? https://t.co/M9d5nlf4bM #auspol #Germanwings
It's a sick world! How can this happen? RT @Reuters #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/ryw6nTmTNF
RT @Reuters: #Germanwings co-pilot Andreas Lubitz had been treated for suicidal tendencies: http://t.co/p7wqBNvoEW http://t.co/KKAGnvXFDd
I suffer #depression too but I would never risk other people's life. #Germanwings
Following code is used to read from the file. It then expands the url and replaces the new url with the old one. It also checks if the url points to an image. If it doesn't, it replaces the url with the web page title. Otherwise it leaves it as it is. The code works fine except for one problem that it takes too much time in this process which isn't suitable for a document with thousands of tweets. How can it make it work faster?
import codecs
from bs4 import BeautifulSoup
import urllib
output = codecs.open('tw1file.txt','w','utf-8')
with open('twfile.txt','r') as inputf:
for line in inputf:
try:
list1 = line.split(' ')
for i in range(len(list1)):
a = list1[i]
if "http" in list1[i]:
##print list1[i]
response = urllib.urlopen(list1[i])
a = response.url
##print a
if 'photo' in a:
##print a
list1[i] = a + ' '
##print list1[i]
else:
html = response.read()
soup = BeautifulSoup(html)
list1[i] = soup.html.head.title
t = str(list1[i])
list1[i] = t[8:-9] = ' '
list1[i] = ''.join(ch for ch in list1[i])
else:
list1[i] = ''.join(ch for ch in list1[i])
line = ' '.join(list1)
print line
output.write(line)
except:
pass
inputf.close()
output.close()