0

I am new to python. I have 1000 files in a folder and I want to run a block of code on all the files in the folder. The files contains textual content (tweets) and I want to remove the "https" and remove all the columns (eg. timestamp, article id etc) apart from the tweet content column Any help would be extremely appreciated.

The columns are ARTICLE_ID HEADLINE AUTHOR CONTENT ARTICLE_URL MEDIA_PROVIDER. My interest of variable is only the Content column

E.g.

The key to a successful backyard BBQ? infused cupcakes. RT if they look delicious! http://....

I want it to look like

The key to a successful backyard BBQ? infused cupcakes. RT if they look delicious!

Rai
  • 11
  • 2
  • Can you post how such a tweet looks, and how you'd like it to look afterwards? – Arne Feb 13 '18 at 13:12
  • 3
    Please reword your question. This is not a code writing service. Show some effort and try to ask specific question. Most of what you are asking has probably been answered many times (how to loop through a directory? How to replace substrings?...) – FlyingTeller Feb 13 '18 at 13:13
  • Do you need help with the parsing, or with running it on all files in a directory? – shayelk Feb 13 '18 at 13:13
  • @FlyingTeller Well, if they don't know how their problem is called, they can hardly search for it, right? I'm all for linking to duplicates, but that can be done in a helpful, constructive and unabrasive way. – Arne Feb 13 '18 at 13:14
  • The key to a successful backyard BBQ? infused cupcakes. RT if they look delicious! http:// I want it to look like The key to a successful backyard BBQ? infused cupcakes. RT if they look delicious – Rai Feb 13 '18 at 13:17
  • @Arne thanks so much for your constructive words. i am just a beginner in Python and did not know how to approach the problem. – Rai Feb 13 '18 at 13:19
  • @Rai No problem =) Include the full content of an exemplary file containing a tweet in your question, and try to format it in a way that it can be easily copy/pasted. – Arne Feb 13 '18 at 13:23
  • with all due respect Rai, @FlyingTeller's words were more constructive, from a pragmatic point of view. His point is: break your project into smaller parts. If you lack algorithmic thinking, then perhaps programming is not for you. – Adelin Feb 13 '18 at 13:23
  • 1
    @Rai you can use https://stackoverflow.com/questions/10377998/how-can-i-iterate-over-files-in-a-given-directory as a starting point on how to iterate over files. Then you should try opening them and reading the content (test it on a folder with only one file for a start), then you can go about trying to modifying the content – FlyingTeller Feb 13 '18 at 13:25
  • @Adelin Everyone starts at zero. How are you supposed to learn programming if people tell you off once you start asking questions. – Arne Feb 13 '18 at 13:26
  • @FlyingTeller Thanks a lot for your inputs. – Rai Feb 13 '18 at 13:33
  • @Adelin I dont know whether I lack or not. Let me spend some time in this domain and then would like to comment on that. Also, without even trying something I cannot comment on my strengths or weaknesses. Its not judicious to discourage at a time when someone is just beginning to learn. – Rai Feb 13 '18 at 13:37
  • @Rai You talked about columns in your file. Is there a tab between them? I am guessing your file is something like \t\t\t? – FlyingTeller Feb 13 '18 at 13:37

1 Answers1

1

As far as I can tell from your question, you want to 1) read the content of all files in a directory, 2) change a local copy of that content, and 3) write that result somewhere else:

1) As @FlyingTeller pointed out, many good answers to that problem exist already. But in short:

import os

tweet_dir = 'some/location/on/your/pc'
for file_name in os.listdir(tweet_dir):
    with open(os.path.join(tweet_dir, file_name)) tweet_file:
        tweet = tweet_file.readlines()
        # now we can modify the content we copied into 'tweet'

2) If you want to know how to modify strings in python, take a look at the documentation of string and maybe also regex. In the loop, deleting everything that looks like a http address can be done like this (but only because tweets have a very strict format about where links are within a message):

tweet = tweet.split('http://')[0]

3) Same as with the other points, a good answer for 'how to write to a file in python' exists already. But in short, once you have modified the tweet in the way you want, you can do this in your inner loop:

# create a directory called 'changed' within the original one by hand, and then:
with open(os.path.join(tweet_dir, 'changed', file_name), 'w') as new_tweet_file:
    new_tweet_file.write(tweet)

done.

If you can split your general problem into nice bite-sized obstacles, you can much better find a solution on StackOverflow or, even better, figure out a solution yourself =)

Arne
  • 17,706
  • 5
  • 83
  • 99
  • 1
    Thanks a lot for your suggestions. You have really made the complex program look easy and doable. I would definitely keep in mind this approach to tackle future problems. – Rai Feb 13 '18 at 13:48