1

My tweets Database, is MongoDB is having the following schema, I want to read this with pandas Dataframe in separate columns. I also want inside components of hashtags- text and indices.

{
        "_id" : ObjectId("5a11200441f0c41f447ce56c"),
        "created" : ISODate("2017-11-19T06:09:06Z"),
        "text" : "#Bitcoin Hong Kong's bitcoin businesses suffer after local bank accounts frozen ,
        "username" : "PennyStocksMomo",
        "hashtags" : [
                {
                        "text" : "Bitcoin",
                        "indices" : [
                                0,
                                8
                        ]
                }
        ],
        "language" : "en",
        "id" : "932128582767296512",
        "followers" : 5715
}

"EDIT"

I used the code below.

import pymongo

import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.BitCoinDatabase
collection = db.tweets
data = pd.DataFrame(list(collection.find()))



_id created followers   hashtags    id  language    text    username
0   5a11200441f0c41f447ce56c    2017-11-19 06:09:06 5715    [{'text': 'Bitcoin', 'indices': [0, 8]}]    932128582767296512  en  #Bitcoin Hong Kong's bitcoin businesses suffer...   PennyStocksMomo
1   5a11200441f0c41f447ce56d    2017-11-19 06:09:06 19526   [{'text': 'Bitcoin', 'indices': [0, 8]}]    932128583077675008  en  #Bitcoin Hong Kong's bitcoin businesses suffer...   CryptoTraderPro
Amar Kumar
  • 43
  • 1
  • 9

1 Answers1

1

Create DataFrame by column hashtags and join to original:

Notice - Solution works if only one dict per row in list in hashtags column.

df = df.join(pd.DataFrame(df['hashtags'].str[0].values.tolist()).add_suffix('_hash'))
print (df)
                        _id          created followers  \
0  5a11200441f0c41f447ce56c   2017-11-19 06:09:06 5715   
1  5a11200441f0c41f447ce56d  2017-11-19 06:09:06 19526   

                                   hashtags                  id language  \
0  [{'text': 'Bitcoin', 'indices': [0, 8]}]  932128582767296512       en   
1  [{'text': 'Bitcoin', 'indices': [0, 8]}]  932128583077675008       en   

                                                text         username  \
0  #Bitcoin Hong Kong's bitcoin businesses suffer...  PennyStocksMomo   
1  #Bitcoin Hong Kong's bitcoin businesses suffer...  CryptoTraderPro   

  indices_hash text_hash  
0       [0, 8]   Bitcoin  
1       [0, 8]   Bitcoin  

EDIT:

AttributeError: 'float' object has no attribute 'keys'

means there are some NaNs values.

I try simulate it:

print (df)
                        _id          created followers  \
0  5a11200441f0c41f447ce56c   2017-11-19 06:09:06 5715   
1  5a11200441f0c41f447ce56d  2017-11-19 06:09:06 19526   
2  5a11200441f0c41f447ce56c   2017-11-19 06:09:06 5715   

                                   hashtags                  id language  \
0  [{'text': 'Bitcoin', 'indices': [0, 8]}]  932128582767296512       en   
1                                       NaN  932128583077675008       en   
2  [{'text': 'Bitcoin', 'indices': [0, 8]}]  932128582767296512       en   

                                                text         username  
0  #Bitcoin Hong Kong's bitcoin businesses suffer...  PennyStocksMomo  
1  #Bitcoin Hong Kong's bitcoin businesses suffer...  CryptoTraderPro  
2  #Bitcoin Hong Kong's bitcoin businesses suffer...  PennyStocksMomo  

Solution is first remove NaNs and add index parameter to DataFrame constructor for align data:

hashtags = df['hashtags'].dropna()
df = df.join(pd.DataFrame(hashtags.str[0].values.tolist(),
             index=hashtags.index).add_suffix('_hash'))
print (df)
                        _id          created followers  \
0  5a11200441f0c41f447ce56c   2017-11-19 06:09:06 5715   
1  5a11200441f0c41f447ce56d  2017-11-19 06:09:06 19526   
2  5a11200441f0c41f447ce56c   2017-11-19 06:09:06 5715   

                                   hashtags                  id language  \
0  [{'text': 'Bitcoin', 'indices': [0, 8]}]  932128582767296512       en   
1                                       NaN  932128583077675008       en   
2  [{'text': 'Bitcoin', 'indices': [0, 8]}]  932128582767296512       en   

                                                text         username  \
0  #Bitcoin Hong Kong's bitcoin businesses suffer...  PennyStocksMomo   
1  #Bitcoin Hong Kong's bitcoin businesses suffer...  CryptoTraderPro   
2  #Bitcoin Hong Kong's bitcoin businesses suffer...  PennyStocksMomo   

  indices_hash text_hash  
0       [0, 8]   Bitcoin  
1          NaN       NaN  
2       [0, 8]   Bitcoin 
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252