1

I am learning how to parse data and trying to create templates I can use for later by just changing the parameters of the loops and functions and methods of the desired code.

So I scraped the twitter api for hash tag related tweets and got back a list of nested dictionaries. I then saved the scraped data to a txt file and have been trying to clean the text and convert it to a table or rows. my problem when trying to create a table is locating headers, because the first line of the txt file has all the headers needed but there is a value next to each header and some values are dictionaries with key value pairs inside. most tutorials have sample files where the first line is clean title headers with no in betweens. but this is more complex and I thought, if I learn how to do this, I would be happy with moving on.

so here is the data sorry if its messy. I cleaned it in notepad by starting each new line with domain (did not know how to do this in python, would be a plus to know). so it starts with a square bracket indicating it is a list, then with in the list is 2 key value pairs and the value for those pairs are both dictionaries with 3-4 kv pairs inside.

all I need to do is convert all the keys to headers for the first line because the keys are the same for all lines in the txt file and then create a table from the headers and values.

[{'domain': {'id': '46', 'name': 'Business Taxonomy', 'description': 'Categories within Brand Verticals that narrow down the scope of Brands'}, 'entity': {'id': '1557696848252391426', 'name': 'Financial Services Business', 'description': 'Brands, companies, advertisers and every non-person handle with the profit intent related to Banks, Credit cards, Insurance, Investments, Stocks '}}, 
{'domain': {'id': '46', 'name': 'Business Taxonomy', 'description': 'Categories within Brand Verticals that narrow down the scope of Brands'}, 'entity': {'id': '1557697333571112960', 'name': 'Technology Business', 'description': 'Brands, companies, advertisers and every non-person handle with the profit intent related to softwares, apps, communication equipments, hardwares'}}, 
{'domain': {'id': '30', 'name': 'Entities [Entity Service]', 'description': 'Entity Service top level domain, every item that is in Entity Service should be in this domain'}, 'entity': {'id': '1007360414114435072', 'name': 'Bitcoin cryptocurrency', 'description': 'Bitcoin Cryptocurrency'}}, 
{'domain': {'id': '30', 'name': 'Entities [Entity Service]', 'description': 'Entity Service top level domain, every item that is in Entity Service should be in this domain'}, 'entity': {'id': '1007361429752594432', 'name': 'Ethereum cryptocurrency', 'description': 'Ethereum Cryptocurrency'}}, 
{'domain': {'id': '47', 'name': 'Brand', 'description': 'Brands and Companies'}, 'entity': {'id': '1372588659346612225', 'name': 'Binance'}}, 
{'domain': {'id': '30', 'name': 'Entities [Entity Service]', 'description': 'Entity Service top level domain, every item that is in Entity Service should be in this domain'}, 'entity': {'id': '857879456773357569', 'name': 'Technology', 'description': 'Technology'}}, 
{'domain': {'id': '66', 'name': 'Interests and Hobbies Category', 'description': 'A grouping of interests and hobbies entities, like Novelty Food or Destinations'}, 'entity': {'id': '913142676819648512', 'name': 'Cryptocurrencies', 'description': 'Cryptocurrency'}}, 
{'domain': {'id': '30', 'name': 'Entities [Entity Service]', 'description': 'Entity Service top level domain, every item that is in Entity Service should be in this domain'}, 'entity': {'id': '1001503516555337728', 'name': 'Blockchain', 'description': 'Blockchain'}}, 
{'domain': {'id': '66', 'name': 'Interests and Hobbies Category', 'description': 'A grouping of interests and hobbies entities, like Novelty Food or Destinations'}, 'entity': {'id': '1369311988040355840', 'name': 'NFTs', 'description': 'Non-fungible tokens'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '781974596148793345', 'name': 'Business & finance'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '781974596794716162', 'name': 'Financial services'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '847894353708068864', 'name': 'Investing', 'description': 'Investing'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '848920371311001600', 'name': 'Technology', 'description': 'Technology and computing'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '913142676819648512', 'name': 'Cryptocurrencies', 'description': 'Cryptocurrency'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1007360414114435072', 'name': 'Bitcoin cryptocurrency', 'description': 'Bitcoin Cryptocurrency'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1007361429752594432', 'name': 'Ethereum cryptocurrency', 'description': 'Ethereum Cryptocurrency'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1369311988040355840', 'name': 'NFTs', 'description': 'Non-fungible tokens'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1390680741206368263', 'name': 'Cryptocurrency exchanges'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1478776259068907541', 'name': 'Cryptotokens'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1484181943616884743', 'name': 'Cryptocoins'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1486271512655003652', 'name': 'Web3'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1491481998862348291', 'name': 'Digital asset industry'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1492162686204854274', 'name': 'Digital assets & cryptocurrency', 'description': 'Cryptocurrency'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1521397643909365760', 'name': 'NFT development'}}, 
{'domain': {'id': '131', 'name': 'Unified Twitter Taxonomy', 'description': 'A taxonomy of user interests. '}, 'entity': {'id': '1536439027636678656', 'name': 'Decentralized finance'}}, 
{'domain': {'id': '174', 'name': 'Digital Assets & Crypto', 'description': 'For cryptocurrency entities'}, 'entity': {'id': '1007360414114435072', 'name': 'Bitcoin cryptocurrency', 'description': 'Bitcoin Cryptocurrency'}}, 
{'domain': {'id': '174', 'name': 'Digital Assets & Crypto', 'description': 'For cryptocurrency entities'}, 'entity': {'id': '1007361429752594432', 'name': 'Ethereum cryptocurrency', 'description': 'Ethereum Cryptocurrency'}}, 
{'domain': {'id': '174', 'name': 'Digital Assets & Crypto', 'description': 'For cryptocurrency entities'}, 'entity': {'id': '1478776259068907541', 'name': 'Cryptotokens'}}]

I tried this code. but the headers cannot be located this way.

import json
import re
import os
from tabulate import tabulate

file = open('binance_hash_tweets_micro.txt', 'r+')
read = file.readlines()
file.close()
modified = []   #this modified variable is a empty list that can be parsed into using loops that call modified

for row in read:
    modified.append(row)

print(modified)

header = modified.pop(0)

def fixed_length(text,length):
    if len(text) > length:
        text = text[:length]
    elif len(text) < length:
        text = (text + " " * length) [:length]
        return text

for column in header:
    print(fixed_length(column,20), end = "  ")
print()

If someone could help. I would appreciate. : )

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • YOu don't need to parse this yourself. Just read the entire thing into a variable and call `ast.literal_eval()` to parse it into a list of dictionaries. – Barmar Dec 15 '22 at 17:54
  • "I then saved the scraped data to a txt file" Why did you do that? Why didn't you use a standard serialization format, like JSON? Or even `pickle`? Just dumping the string representation of a Python object to a text file **is not serialization**. – juanpa.arrivillaga Dec 15 '22 at 17:58
  • @juanpa.arrivillaga I will probably use the correct serialization method after this comment. But the problem would still exist on how to create headers from the keys situated on the first line and create a table with those headers and fill it with the values from all the other lines. weather json or txt i figured if icould learn the comprehension techniques in any format of data it would apply alos to json or xml etc. – throothewire Dec 15 '22 at 18:12
  • @Barmar thanks for the advise. From looking at the data, it looks like it is already a list of nested dictionaries. – throothewire Dec 15 '22 at 18:20
  • Yes, that's my point. If you do `file.write(str(list_of_dictionaries))` to create the file, you can use `ast.literal_eval(file.read())` to get back the original data. – Barmar Dec 15 '22 at 18:25
  • You can get the column headings with `list(data[0].keys())` – Barmar Dec 15 '22 at 18:29
  • @Barmar I used your code and replaced list_of_dictionaries with my text file is used ast.literal_eval(file.read but ast is not defined. Is it from another python library ?. – throothewire Dec 15 '22 at 18:34
  • @Barmar tried this file = open('binance_hash_tweets_micro.txt', 'r+') read = file.readlines() #file.close() #modified = [] #this modified variable is a empty list that can be parsed into using loops that call modified list(read[0].keys('domain', 'id', 'name', 'entity', 'id', 'name', 'description')) print(read) got tracebacks object is not subscriptable.......................? have you tried to run your solutions with the data above ? did it work for you?. – throothewire Dec 15 '22 at 18:42
  • `keys()` returns the list of keys in a dictionary, you don't give it arguments. – Barmar Dec 15 '22 at 20:01

1 Answers1

0

You don't need to parse it yourself, use ast.literal_eval() to parse it.

import ast

with open('binance_hash_tweets_micro.txt', 'r') as f:
    binance_list = ast.literal_eval(f.read())

first = binance_list[0]
header = ['domain_' + key for key in first['domain']] + ['entity_' + key for key in first['entity']]
print(header)

This will print

['domain_id', 'domain_name', 'domain_description', 'entity_id', 'entity_name', 'entity_description']
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • i ran this code exactly the same and got tracebacks errors///line 4, in binance_list = ast.literal_eval(f.read())/////////////Python311\Lib\ast.py", line 62, in literal_eval node_or_string = parse(node_or_string.lstrip(" \t"), mode='eval')////////////////Python311\Lib\ast.py", line 50, in parse return compile(source, filename, mode, flags,///////////////line 1 [{'domain': {'id': '46', 'name': 'Business Taxonomy', 'description': 'Categories within Brand Verticals that narrow down the scope of – throothewire Dec 15 '22 at 20:43
  • I copied your text file and it worked fine for me. – Barmar Dec 15 '22 at 20:50
  • What is the error message at the end of the traceback? – Barmar Dec 15 '22 at 20:52
  • SyntaxError: unterminated string literal (detected at line 1) – throothewire Dec 15 '22 at 20:53
  • import ast with open('binance_hash_tweets_micro.txt', 'r') as f: binance_list = ast.literal_eval(f.read()) first = binance_list[0] header = ['domain_' + key for key in first['domain']] + ['entity_' + key for key in first['entity']] print(header) – throothewire Dec 15 '22 at 20:53
  • That means there's something wrong with the file, it has mismatched quotes. – Barmar Dec 15 '22 at 20:54
  • See https://stackoverflow.com/questions/70780266/unterminated-string-literal for the meaning of that error. – Barmar Dec 15 '22 at 20:57
  • I'm not going to spend any more time on this. Bite the bullet and use JSON or Pickle to save and restore the data. – Barmar Dec 15 '22 at 20:58
  • ok youre right the uploaded text is different i dragged the text from my question and did the same as you. – throothewire Dec 15 '22 at 21:01
  • any idea on how to create a table from the rest of the data in the file?. thank you so much for your help friend:)... – throothewire Dec 15 '22 at 21:02
  • I'm not a pandas expert, but I think you should be able to use `pd.json_normalize()`. – Barmar Dec 15 '22 at 21:03
  • ok I will look into it. thank you for your time..now i have to learn to convert txt file to json to run these json and pandas library codes.... – throothewire Dec 15 '22 at 21:06
  • You don't convert the file to JSON directly. Read the file into a list using my code, then use `json.dump()` to write it to a JSON file. Then use JSON from then on. – Barmar Dec 15 '22 at 21:08