1

I'm parsing from tweets data which is json format and compressed with gzip.

Here's my code:

###Preprocessing
##Importing:
import os
import gzip
import json
import pandas as pd
from pandas.io.json import json_normalize

##Variables:
#tweets: DataFrame for merging. empty
tweets = pd.DataFrame()
idx = 0

#Parser provides parsing the input data and return as pd.DataFrame format

###Directory reading:
##Reading whole directory from
for root, dirs, files in os.walk('D:/twitter/salathe-us-twitter/11April1'):
    for file in files:
        #file tracking, #Memory Checker:
        print(file, tweets.memory_usage())
        # ext represent the extension.
        ext = os.path.splitext(file)[-1]
        if ext == '.gz':
            with gzip.open(os.path.join(root, file), "rt") as tweet_file:
                # print(tweet_file)
                for line in tweet_file:
                    try:
                        temp = line.partition('|')
                        date = temp[0]
                        tweet = json.loads(temp[2])
                        if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
                            # Mapping for memory.
                            # The index must be sequence like series.
                            # temporary solve by listlizing int values: id, retweet-count.
                            #print(tweet)
                            temp_dict = {"id": tweet["user"]["id"],
                                         "text": tweet["text"],
                                         "hashtags": tweet["entities"]["hashtags"][0]["text"],
                                         "date":[int(date[:8])]}
                            #idx for DataFrame ix
                            temp_DF = pd.DataFrame(temp_dict, index=[idx])
                            tweets = pd.concat([tweets, temp_DF])
                            idx += 1
                    except:
                        continue
        else:
            with open(os.path.join(root, file), "r") as tweet_file:
                # print(tweets_file)
                for line in tweet_file:
                    try:
                        temp = line.partition('|')
                        #date
                        date = temp[0]
                        tweet = json.loads(temp[2])
                        if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
                            # Mapping for memory.
                            # The index must be sequence like series.
                            # temporary solve by listlizing int values: id, retweet-count.
                            #print(tweet)
                            temp_dict = {"id": [tweet["user"]["id"]],
                                         "text": tweet["text"],
                                         "hashtags": tweet["entities"]["hashtags"][0]["text"],
                                         "date":[int(date[:8])]}
                            temp_DF = pd.DataFrame(temp_dict, index=[idx])
                            tweets = pd.concat([tweets, temp_DF])
                            idx += 1
                    except:
                        continue

##STORING PROCESS.
store = pd.HDFStore('D:/Twitter_project/mydata.h5')
store['11April1'] = tweets
store.close()

My code can be distinct to 3 parts: reading, processing to select columns and storing. What I interest is that I want to parsing them more faster. So here's my questions: It's too slow. How could it be much faster? read by pandas json reader? Well I guess it's much faster than normal json.loads... But! Because my raw tweet data have multi-index values. So pandas read_json didn't work. And overally, I'm not sure I implemented my code well. Are there something problems or better way? I'm kinda new on programming. So please teach me to do much better.

p.s The computer just turned off while the code is running. Why this happen? Memory problem?

Thanks to read this.

p.p.s

20110331010003954|{"text":"#Honestly my toe still aint healed im suppose to be in that boot still!!!","truncated":false,"in_reply_to_user_id":null,"in_reply_to_status_id":null,"favorited":false,"source":"web","in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"id_str":"53320627431550976","entities":{"hashtags":[{"text":"Honestly","indices":[0,9]}],"user_mentions":[],"urls":[]},"contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"place":{"country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-84.161625,35.849573],[-83.688543,35.849573],[-83.688543,36.067417],[-84.161625,36.067417]]]},"attributes":{},"full_name":"Knoxville, TN","name":"Knoxville","id":"6565298bcadb82a1","place_type":"city","url":"http:\/\/api.twitter.com\/1\/geo\/id\/6565298bcadb82a1.json"},"retweet_count":0,"created_at":"Thu Mar 31 05:00:02 +0000 2011","user":{"notifications":null,"profile_use_background_image":true,"default_profile":true,"profile_background_color":"C0DEED","followers_count":161,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1220577968\/RoadRunner_normal.jpg","is_translator":false,"profile_background_image_url":"http:\/\/a3.twimg.com\/a\/1301071706\/images\/themes\/theme1\/bg.png","default_profile_image":false,"description":"Cool & Calm Basically Females are the way of life and key to my heart...","screen_name":"FranklinOwens","verified":false,"time_zone":"Central Time (US & Canada)","friends_count":183,"profile_text_color":"333333","profile_sidebar_fill_color":"DDEEF6","location":"","id_str":"63499713","show_all_inline_media":true,"follow_request_sent":null,"geo_enabled":true,"profile_background_tile":false,"contributors_enabled":false,"lang":"en","protected":false,"favourites_count":8,"created_at":"Thu Aug 06 18:24:50 +0000 2009","profile_link_color":"0084B4","name":"Franklin","statuses_count":5297,"profile_sidebar_border_color":"C0DEED","url":null,"id":63499713,"listed_count":0,"following":null,"utc_offset":-21600},"id":53320627431550976,"coordinates":null,"geo":null}

it's just one line. I have more than 200GB which is compressed with gzip file. I guess the number at very first refers to its date. I'm not sure it's clear to you.

Yoo Inhyeok
  • 101
  • 1
  • 2
  • 9
  • 1
    can you give an example of the input file? – taras Apr 16 '17 at 06:28
  • Um.. I connect link on th letter 'multi index values'. It's almost same except there's date at very front of date. And I'm not sure I can post this. Because it's real data, it can cause some legal problem. – Yoo Inhyeok Apr 17 '17 at 11:01
  • I'm sorry. It's not my laguage so hard to conversation. If you hardly read, tell me so I can fix and make it more clear. – Yoo Inhyeok Apr 17 '17 at 11:02
  • you do not have to post a real data, make some dummy data (examples) in a format of a real data. Replace everything important with quotes from your favourite poem :) – taras Apr 17 '17 at 15:41
  • @Taras Ok. I post it. – Yoo Inhyeok Apr 18 '17 at 04:08
  • There is a space for improvement and I am not sure I will be able to post it today. Try to remove `pd.concat`, with each call makes a full copy of the data which creates a significant performance hit. Gather all data into an array and then create a DataFrame from it. – taras Apr 18 '17 at 18:31

1 Answers1

1

First of all, my congratulations. You get better as a software engineer when you face real world challenges like this one.

Now, talking about your solution. Every software works in 3 phases.

  1. Input data.
  2. Process data.
  3. Output data. (response)

Input data

1.1. boring staff

The information preferably should be in one format. To achieve that we write parsers, API, wrappers, adapters. The idea behind all of that is to transform data into the same format. This helps to avoid issues working with different data sources, if one of them brakes - you fix only one adapter and that's it, all other and your parser still works.

1.2. your case

You have data coming in the same scheme but in different file formats. You can either convert it to one format as read as json, txt or extract a method that transforms data into separate function or module and reuse/call it 2 times. example:

with gzip.open(os.path.join(root, file), "rt") as tweet_file:
    process_data(tweet_file)
with open(os.path.join(root, file), "r") as tweet_file:
    process_data(tweet_file)

process_data(tweet_file):
   for line in tweet_file:
       # do your stuff

2. Process data

2.1 boring staff

Most likely this is a bottleneck part. Here your goal is to transform data from the given format into the desired format and do some actions if required. Here you get all exceptions, all performance issues, all business logic. This is where SE craft comes handy, you create an architecture and you decide how many bugs to put in it.

2.2 your case

The simplest way to deal with the issue is to know how to find it. If this is performance - put timestamps to track it. With experience, it will get easier to spot the issues. In this case, dt.concat most likely causes the performance hit. With each call it copies all the data to a new instance, thus you have 2 memory objects when you need only 1. Try to avoid it concat, gather all data into a list and then put it into the DataFrame.

For instance, I would not put all the data into the DataFrame on the start, you can gather it and put into a csv file and then build a DataFrame from it, pandas deals with csv files really well. Here is an example:

import json
import pandas as pd
from pandas.io.json import json_normalize
import csv

source_file = '11April1.txt'
result_file = 'output.csv'


with open(source_file) as source:
    with open(result_file, 'wb') as result:
        writer = csv.DictWriter(result, fieldnames=['id','text','hashtags','date','idx'])
        writer.writeheader();

         # get index together with a line
        for index, line in enumerate(source):
            # a handy way to get data in 1 func call.
            date, data = line.split('|')
            tweet = json.loads(data)
            if tweet['user']['lang'] != 'en' or tweet['place']['country_code'] != 'US':
                continue

            item =  {"id": tweet["user"]["id"],
                     "text": tweet["text"],
                     "hashtags": tweet["entities"]["hashtags"][0]["text"],
                     "date":[int(date[:8])],
                     "idx": index}

            # either write it to the csv or save into the array
            # tweets.append(item)
            writer.writerow(item)

print "done"

3. Output data.

3.1. boring staff

After your data is processed and in the right format, you need to see the results, right? This is where HTTP responses and page loads happen, where pandas builds graphs etc. You decide what kind of output you need, that's why you created software, to get what you want from the format you did not want to go through by yourself.

3.2 your case

You have to find an efficient way to get the desired output from the processed files. Maybe you need to put data into HDF5 format and process it on Hadoop, in this case, your software output becomes someone's software input, sexy right? :D Jokes aside, gather all processed data from csv or arrays and put it into the HDF5 by chunks, this is important as you cannot load everything into RAM, RAM was called temporary memory within a reason, it is fast and very limited, use it wisely. This is reason your PC turned off, from my opinion. Or there maybe a memory corruption due to some C libraries nature which is OK from time to times.

Overall, try to experiment and get back to StackOverflow if anything.

taras
  • 3,579
  • 3
  • 26
  • 27
  • Thank you so much. I really appreciate your help. And I have a question. In section 1.2, you said 'call a function or module'. I guess it's much slower than my way but it isn't? – Yoo Inhyeok Apr 21 '17 at 04:07
  • And second, you mean that gathering all tweet and put them on csv format and then put them on dataframe at once, right? and even the processes are seperated two part: save them to csv file; call csv file for save dataframe format right? And also it's much faster? – Yoo Inhyeok Apr 21 '17 at 04:19
  • Oh wait. I need the data for text mining; with nltk python package. Then I don't think I need dataFrame package or HDF5 format right? – Yoo Inhyeok Apr 21 '17 at 04:25
  • 1) If you extract functional in a function - it does not make it faster or slower, it just allows you to reuse same functionality and avoid bugs produced by duplication. – taras Apr 21 '17 at 06:34
  • 2) Right, separate it into 2 parts, it is my guess that it should be faster to gather everything into a list or a csv file and then process everything at once. Processing each line from the start to the end is not a good idea. Once you get rid of `combine` it will be faster. – taras Apr 21 '17 at 06:35
  • 3) You might not need a DataFrame and use a CSV file straight by nltk. The idea behind CSV file generation is to `filter/clean input data for further processing`. After that you can move it to HDF5, DataFrame or a database. – taras Apr 21 '17 at 06:35
  • Okay. I understand. Thank you! Oh another one! why you use 'wb' for open csv file? – Yoo Inhyeok Apr 21 '17 at 13:35
  • You are welcome :) Please accept the answer if it solves a problem. It might not, you decide. Here is an answer about file modes: http://stackoverflow.com/questions/1466000/python-open-built-in-function-difference-between-modes-a-a-w-w-and-r – taras Apr 21 '17 at 14:40