-1

I'm trying to transform a text file which looks like the following:

14/10/2019 13:00:19 | www.google.com | {"type":"click", "user":"root", "ip":"0.0.0.0"}
14/10/2019 13:02:19 | www.google.com | {"type":"click", "user":"root", "ip":"0.0.0.0"}
14/10/2019 13:05:19 | www.google.com | {"type":"click", "user":"root", "ip":"0.0.0.0"}

With many more rows of the logs. I need to convert it so it is all in a single json object like the following:

{"date_time": "2019-10-14 13:00:19", "url": "www.google.com","type":"click", "user":"root", "ip":"0.0.0.0"}

But I cannot seem to work out an obvious way in Python, any help appreciated

devnotdev
  • 85
  • 8
  • Welcome to StackOverflow! Why don't you add headers to your file with names of your fields, load it to Pandas DataFrame and convert it to json like it's described here - https://stackoverflow.com/questions/50384883/convert-pandas-dataframe-to-json-object-pandas – Stepan Novikov Oct 29 '19 at 18:00

3 Answers3

1

You could use datetime and json module. Open the file and iterate over lines, you may need to adapt some parts of the code.

strptime behavior

Working example:

import datetime
import json

in_text = """14/10/2019 13:00:19 | www.google.com | {"type":"click", "user":"root", "ip":"0.0.0.0"}
14/10/2019 13:02:19 | www.google.com | {"type":"click", "user":"root", "ip":"0.0.0.0"}
14/10/2019 13:05:19 | www.google.com | {"type":"click", "user":"root", "ip":"0.0.0.0"}"""

item_list = []
for line in in_text.split("\n"):
    date, url, json_part = line.split("|")
    item = {
        "date_time": datetime.datetime.strptime(date.strip(), "%d/%m/%Y %H:%M:%S"),
        "url": url.strip(),
    }
    item.update(json.loads(json_part))
    item_list.append(item)

print(item_list)

To read lines from a file:

with open("your/file/path.txt") as fh:
    for line in fh:
        # Copy the code from the above example.
        ...
mpawlak
  • 199
  • 1
  • 4
0

Use pandas:

  • Given your data, as described, in a .txt file.
  • .to_json has various parameters to customize the final look of the JSON file.
  • Having the data in a dataframe has the advantage of allowing for additional analysis
  • The data has a number of issues that can easily be fixed
    • No column names
    • Improper datatime format
    • Whitespace around the URL
import pandas as pd

# read data
df = pd.read_csv('test.txt', sep='|', header=None, converters={2: eval})

# convert column 0 to a datatime format
df[0] = pd.to_datetime(df[0])

# your data has whitespace around the url; remove it
df[1] = df[1].apply(lambda x: x.strip())

# make column 2 a separate dataframe
df2 = pd.DataFrame.from_dict(df[2].to_list())

# merge the two dataframes on the index
df3 = df.merge(df2, left_index=True, right_index=True, how='outer')

# drop old column 2
df3.drop(columns=[2], inplace=True)

# name column 0 and 1
df3.rename(columns={0: 'date_time', 1: 'url'}, inplace=True)

# dataframe view
          date_time               url   type  user       ip
2019-10-14 13:00:19   www.google.com   click  root  0.0.0.0
2019-10-14 13:02:19   www.google.com   click  root  0.0.0.0
2019-10-14 13:05:19   www.google.com   click  root  0.0.0.0

# same to a JSON
df3.to_json('test3.json', orient='records', date_format='iso')

JSON file

[{
        "date_time": "2019-10-14T13:00:19.000Z",
        "url": "www.google.com",
        "type": "click",
        "user": "root",
        "ip": "0.0.0.0"
    }, {
        "date_time": "2019-10-14T13:02:19.000Z",
        "url": "www.google.com",
        "type": "click",
        "user": "root",
        "ip": "0.0.0.0"
    }, {
        "date_time": "2019-10-14T13:05:19.000Z",
        "url": "www.google.com",
        "type": "click",
        "user": "root",
        "ip": "0.0.0.0"
    }
]
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
0
import json
from ast import literal_eval

def transform_to_json(row):

    d = literal_eval(row[2].strip())
    d["date_time"] = row[0]
    d["url"] = row[1]

    return d


with open('example.txt', 'r') as file:
    json_objs = [transform_to_json(row.split('|')) for row in file.readlines()]

single_json_result = json.dumps(json_objs)
Rithin Chalumuri
  • 1,739
  • 7
  • 19