1

I want to convert a text file into a json lines format using Python. I need this to be applicable to a text file of any length (in characters or words).

As an example, I want to convert the following text;

A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so. 

These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics.

To this:

{"text": "A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so."}
{"text": "These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics."}

I tried this:

text = ""
with open(text.txt", encoding="utf8") as f:
    for line in f:
        text = {"text": line}

But not luck.

johnadem
  • 153
  • 2
  • 12
  • 2
    So you mean you want to iterate over your text lines, put each in a dictionary using `"text"` as the key, convert it to JSON and append it to a file? – mkrieger1 Dec 28 '21 at 01:12
  • Something like this could work. I'll need to save it as a .jsonl though – johnadem Dec 28 '21 at 01:13
  • Then use a filename ending in `.jsonl` when opening a file for writing. – mkrieger1 Dec 28 '21 at 01:14
  • So at which of these steps was there a problem? – mkrieger1 Dec 28 '21 at 01:15
  • I'm not sure how to iterate over the text lines as you've mentioned. – johnadem Dec 28 '21 at 01:17
  • How about https://stackoverflow.com/questions/8009882/how-to-read-a-large-file-line-by-line – mkrieger1 Dec 28 '21 at 01:19
  • I gave it try but can't get it to work. I tried with open(file.txt", encoding="utf8") as f: for line in f: jline = {"text": lines} text = f.readlines() – johnadem Dec 28 '21 at 01:27
  • @mkrieger1 Having no luck so far. I tried with open(file.txt, encoding="utf8") as f: for line in f: jline = {"text": line} text = text.append(jline) – johnadem Dec 28 '21 at 01:43
  • The code as shown has a rather obvious syntax error. Is that your actual problem? Which parts do you know how to do, and which are you struggling with? – MisterMiyagi Dec 28 '21 at 10:24
  • @MisterMiyagi I'm struggling to convert each line in the text file to a dictionary with "text" as the key and the line of text as the value. I think I am iterating over the file with the for statement but don't know how to convert to the format I need within the loop. – johnadem Dec 28 '21 at 10:26
  • Did anyone encounter an existing toolchain that does this ? I presume LLM training data generation runs into this question and there should be a functional library for this. – Hakan Baba Apr 21 '23 at 05:21

2 Answers2

3

The basic idea of your for loop was correct but the line text = {"text": line} is just overwriting the previous line every time, whereas what you want is to generate a list of lines.

Try the following:

import json

# Generate a list of dictionaries
lines = []
with open("text.txt", encoding="utf8") as f:
    for line in f.read().splitlines():
        if line:
            lines.append({"text": line})

# Convert to a list of JSON strings
json_lines = [json.dumps(l) for l in lines]

# Join lines and save to .jsonl file
json_data = '\n'.join(json_lines)
with open('my_file.jsonl', 'w') as f:
    f.write(json_data)

splitlines removes the \n characters and if line: ignores blank lines.

ljdyer
  • 1,946
  • 1
  • 3
  • 11
0

A hacky way of doing this is to paste the text file into a csv. Make sure to write text in the first cell of the csv then use this code:

import pandas as pd 

df = pd.read_csv(knowledge)
    df.to_json(knowledge_jsonl,
               orient="records",
               lines=True)

Not ideal but it works.

johnadem
  • 153
  • 2
  • 12