2

I have the giant dataset as the following. Most of the columns are in "Json format" but in pandas object datatype. I would like to convert them all into dicts and be able to fetch the values of each key.

    id                              body                                 created_at            user                                                                     source          symbols entities    conversation    likes
0   10000001      sina are the stocks i hate to trade...  2012-10-15 18:54:55+00:00   {'id': 117974, 'username': 'choptrade', 'name'...   {'id': 1, 'title': 'Stocktwits', 'url': 'https...   [{'id': 2839, 'symbol': 'NFLX', 'title': 'NetF...   {'sentiment': None} None    None
1   10000002    An Introduction to Financial Spread Betting ht...   2012-10-15 18:55:00+00:00   {'id': 179410, 'username': 'KennysElliottWaves...   {'id': 2, 'title': 'Twitter', 'url': 'http://t...   [{'id': 667, 'symbol': 'EURUSD', 'title': 'Eur...   {'sentiment': None} None    None
2   10000003    Published $TXT estimate on Estimize: +0.05 EPS...   2012-10-15 18:55:00+00:00   {'id': 12134, 'username': '1nvestor', 'name': ...   {'id': 2, 'title': 'Twitter', 'url': 'http://t...   [{'id': 7506, 'symbol': 'TXT', 'title': 'Textr...   {'sentiment': None} None    None
3   10000005    $VRNG holding lot of shares under 3. Hold or s...   2012-10-15 18:55:06+00:00   {'id': 146417, 'username': 'CryptoMafia', 'nam...   {'id': 1, 'title': 'Stocktwits', 'url': 'https...   [{'id': 9214, 'symbol': 'XSPA', 'title': 'Xpre...   {'sentiment': None} {'parent_message_id': 10000005, 'in_reply_to_m...   None
4   10000006    We have a huge tail on oil. If we close near c...   2012-10-15 18:55:07+00:00   {'id': 86846, 'username': 'johnkicklighter', '...   {'id': 1, 'title': 'Stocktwits', 'url': 'https...   [{'id': 643, 'symbol': 'CL_F', 'title': 'Light...   {'chart': {'thumb': 'https://charts.stocktwits...   {'parent_message_id': 10000006, 'in_reply_to_m...   None
... ... ... ... ... ... ... ... ... ...
173592731   459260899   $BBIG I really thought we would get a PR this ...   2022-05-13 12:39:08+00:00   {'id': 3231219, 'username': 'DMaCATO', 'name':...   {'id': 2269, 'title': 'StockTwits Web', 'url':...   [{'id': 16144, 'symbol': 'BBIG', 'title': 'Vin...   {'sentiment': None} None    None
173592732   459260904   $LUNA Absolutely iconic in the story of why 40...   2022-05-13 12:39:09+00:00   {'id': 5647436, 'username': 'Julieblue', 'name...   {'id': 2269, 'title': 'StockTwits Web', 'url':...   [{'id': 2563, 'symbol': 'LUNA', 'title': 'Luna...   {'sentiment': {'basic': 'Bearish'}} None    None
173592733   459260906   $TWTR get these people on the ropes and drop b...   2022-05-13 12:39:09+00:00   {'id': 4253849, 'username': 'Trump2036', 'name...   {'id': 2269, 'title': 'StockTwits Web', 'url':...   [{'id': 7872, 'symbol': 'TWTR', 'title': 'TWIT...   {'sentiment': None} None    None
173592734   459260907   $SPY I want the shorts to cry today 2022-05-13 12:39:09+00:00   {'id': 1129723, 'username': 'labibaouf', 'name...   {'id': 2269, 'title': 'StockTwits Web', 'url':...   [{'id': 7271, 'symbol': 'SPY', 'title': 'SPDR ...   {'sentiment': {'basic': 'Bullish'}} None    None
173592735   459260909   TSLA but bad for democracy and... 2022-05-13 12:39:09+00:00   {'id': 727683, 'username': 'Iwillplay', 'name'...   {'id': 2095, 'title': 'StockTwits For Android ...   [{'id': 7872, 'symbol': 'TWTR', 'title': 'TWIT...   {'sentiment': None} None    None

my current apporach is to use the pandas apply:

import ast
def dict_convert(x):
    try:
        return ast.literal_eval(x)
    except:
        return np.nan
df["col"] = df.col.apply(lambda x: dict_convert(x))

However this would take so much time and it is not computational efficient. I have 64 cores so I would like to have some vertorrized method. I also tried the Pandarallel module it ended up being taking significantly more time than the base pandas apply.

pd.json_normalize does not seem to work either. I think it is due to the instances being in object data type.

pd.json_normalize(df.col.iloc[:100].values)

out:
0
1
2
3
4
...
95
96
97
98
99
100 rows × 0 columns
Kevin Li
  • 43
  • 8
  • Try pyspark or cuDF – ahmedshahriar May 22 '22 at 04:29
  • By using the same apply function? – Kevin Li May 22 '22 at 04:30
  • since u have 64 cores so when you used pandarallel it was using 64 workers, right? – ahmedshahriar May 22 '22 at 04:39
  • Yes it was using 64 workers. And the CPU times were much less compared to the base pandas apply. But the wall time somehow increased by a significant margin almost 3X – Kevin Li May 22 '22 at 04:40
  • I had a similar problem whereI used pd.apply +pandarallel on two large csv. It took days. Since it's a big dataset, you can try pyspark. Check [pandas.io.json.json_normalize in pyspark dataframe.](https://stackoverflow.com/questions/59870193/is-there-a-function-in-pyspark-dataframe-that-is-similar-to-pandas-io-json-json) – ahmedshahriar May 22 '22 at 04:47
  • 1
    Thanks for the suggestions admed but I have two problems: 1. The instances are in object format instead of dict therefore I dont think json normalize will work at the first place. 2. My cluster node does not have access to internet. Do you have any other suggestion? – Kevin Li May 22 '22 at 05:07
  • 2
    Check this answer, this might be helpful for you [parse a string column that looks like JSON/dic](https://stackoverflow.com/a/70239454/11105356) – ahmedshahriar May 22 '22 at 06:24

0 Answers0