-1

I got a problem i could use some help with. I got a txt file (large file) in python, which i have to open and read from.

After that i need to remove some names, links and stuff from the text, that i don't need.

At last i should print out line by line with a for loop or something like that.

My code so far:

import re

tweet = []

with open("englishtweet.txt","r") as infile:
        tweet = infile.readlines()



for line in tweet:
    print line

If i show the first two lines in the file i get:

@xirwinshemmo thanks for the follow :)

hii... if u want to make a new friend just add me on facebook! :) xx https:\/\/t.co\/RCYFVrmdDG        

Here i have to remove all names like: @xirwinshemmo

Also need to remove http links like: https://t.co/RCYFVrmdDG

After that i have to make a for loop that runs through every line in the file so i can run this code:

for line in tweet:
    if ':)' in line:
        cl.train(line,'happy')

   else if ':(' in line:
        cl.train(line,'sad')

Hope someone understand my question and can advice me.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Raaydk
  • 147
  • 2
  • 12

1 Answers1

0

Check out my solution. It should work with really big files that do not fit into your RAM. Also it has separate regexes list, so you can extend it easily:

import re

parts_to_remove = (
    r'@\w+',
    r'https?://[\da-z.-/]+'
)

with open('englishtweet.txt', 'r') as infile:
    for line in infile:

        for part in parts_to_remove:
            re.sub(part, '', line)

        if ':)' in line:
            cl.train(line, 'happy')
        elif ':(' in line:
            cl.train(line, 'sad')