0

I try to clean some twitter data stored in a csv file using python in jupyter notebook, so I try this code:

unwanted_characters = [',', '@', '\n','&','_']
           
with open('facebook_Tweet.csv','r') as f:
   with open('cleaned_facebook_Tweet.csv','w') as ff:
        for unwanted in unwanted_characters:
            ff.write(f.read().replace(unwanted,''))

                    
tweety = pd.read_csv("cleaned_facebook_Tweet.csv", error_bad_lines=False)
tweety.head()

When I run this code I got this result:

tweet1:Time:Sun Dec 06 09:59:02 +0000 2020 tweet text:RT @_Aaron_Anthony_: Seen this of Facebook and it hit home.\n\nRemember this Christmas if someone pays \u00a320 for a gift for you and they get\u2026
tweet2:Time:Sun Dec 06 09:59:02 +0000 2020 tweet text:RT @TopAchat: Concours \ud83c\udf81 #PetitPapaTopAchat \ud83c\udf84\n\n\ud83d\udd25 + de 60 000 \u20ac de cadeaux \u00e0 gagner !\n\nCa continue avec le #Lot7 de 4333 \u20ac ! \ud83d\udd25\n\nPour partici\u2026

As you can see the unwanted characters stayed, and my code just remove the first unwanted character in my example was the ',' and keep the others example the '@' and the '\n'.

How can I fix my code ? and thanks a lot.

Sekmani52
  • 55
  • 8

1 Answers1

3

Ey there!

You "cannot" read an opened file twice. At least that would not work as you may expect. When you call f.read() it returns the content of the file from beginning to end and leaves the reading cursor at the end of the file. So, when you call f.read() again, it returns nothing.

Also, even if that works as you may think, you would be appending the entire file multiple times with each replacement and the end result will not be the expected. That's because you're calling to the write() method multiple times.

My advice here: use an intermediate variable, something like this:

import pandas as pd
unwanted_characters = [',', '@', '\n','&','_']
           
with open('facebook_Tweet.csv','r') as f:
   output_string = f.read()
   for unwanted in unwanted_characters:
     output_string = output_string.replace(unwanted, '')

   with open('cleaned_facebook_Tweet.csv','w') as ff:
    ff.write(output_string)
                    

The final code does not matter too much but I think it is important that you understand the concepts I have explained. I recommend you to read this doc and maybe this one as well.

Also, this question in stack overflow may help you to understand what I said.

spotHound
  • 320
  • 2
  • 15
  • Hi thanks for the speed answer, the code work now, but I think that I choose the bad way to clean data, the process take much time and each time block the page. Is there a better way ? – Sekmani52 Dec 12 '20 at 23:06
  • 1
    Sure. If you have a huge amount of data this loop would be too slow. Pandas is a framework that helps you t manage data, just use it! Pandas use NumPy to work with collections and parallelism. I recommend trying a parallelized alternative. You may find interesting these others stack overflow related questions: https://stackoverflow.com/questions/13682044/remove-unwanted-parts-from-strings-in-a-column and https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame I recommend you looking for solutions that use pandas/numpy methods to take advantage of parallelism. – spotHound Dec 12 '20 at 23:13