2

I have a file of sms messages in plain text that I want to convert to a CSV file. The format is as follows:

Sent on 1/1/2023 7:30:33 AM to Person1

Message

-----

Received on 5/20/2023 4:55:33 PM from Person1

Message

I want to loop through the text file, retrieve the lines and create a result like the one below.

Status Date Contact Message
Sent 1/1/2023 7:30:33 AM Person1 Message
Received 5/20/2023 4:55:33 PM Person1 Message

I started with the code below, but I'm pretty new to Python and can't figure out how to transpose the lines into columns.

import csv
import openpyxl

input_file = 'output.txt'
output_file = 'allmessages.csv'

wb = openpyxl.Workbook()
ws = wb.worksheets[0]

with open(input_file, 'r') as data:
    reader = csv.reader(data, delimiter='\t')
    for row in reader:
        ws.append(row)
        
wb.save(output_file)

Any suggestions would be greatly appreciated!

2 Answers2

1

Let me introduce you to the wonderful world of regular expressions!

I threw together this:

regexp = r"(Sent|Received).*?(\d.*?(?:AM|PM))\s(?:to\s|from\s)(\w*?)\n\n([\s\S]*?)(?:\n\-{5}|$)[\S\s]*?"

You can try it here: Regex101

This expression returns 4 groups per match, with the groups containting the status, timestamp, recipient, and message respectfully

I suggest you take a look at python's re package for how to extract this info exactly, but here's a post with a bit more info: How can i find all matches to a regular expression in Python

To explain the expression above a little:

(Sent|Received)
// A group matching either the literal "Sent" or "Received"
.*?
// Consume everything until the next match, but be careful not to skip anything
(\d.*?(?:AM|PM))
// A group containing anything between a digit and either the literal "AM" or "PM". 1234AM would match here, but so would 5/20/2023 4:55:33 PM, so as long as the format is consistent it'll work.
\s
// Consume a single space
(?:to\s|from\s)
// Consume either the literal "to " or "from "
(\w*?)
// A group containin sequantial alphanumeric characters without skipping the next match
\n\n
// Consume 2 newlines
([\s\S]*?)
// A group containing literally anything, but making sure it doesn't skip the next match
(?:\n\-{5}|$)
// Consume either the literal "\n-----" or the end of the entire file
[\S\s]*?
// Consumes literally anything until the start of the next match
Jam
  • 476
  • 3
  • 9
1

If your data is very consistent like you showed, then @Jam's regexp will probably do.

If you find your data less than normal and the regexp failing, you'll probably want a little state machine that understands the structure of the input TXT:

  • blank lines can be skipped
  • "-----" is the message boundary, the point in the text where everything that directly precedes makes up a complete output row
  • "Sent" or "Received" are a header, and mark the beginning of a message
  • every other kind of line should be considered to be part of the message
import csv
import re


def is_header(line: str) -> bool:
    if line.startswith("Sent on"):
        return True
    if line.startswith("Received on"):
        return True
    return False


def get_header(line: str) -> list[str]:
    # you might be okay with line.strip().split('\t')
    fields = re.split(r"\s+", line.strip())

    # Sent on 5/20/2023 4:55:33 PM from Person1
    # 0    1  2         3       4  5    6
    status = fields[0]
    date = " ".join(fields[2:5])
    contact = fields[6]

    return [status, date, contact]


def is_boundary(line: str) -> bool:
    return line.strip() == "-----"


# Initialize rows w/CSV header
rows = [
    ["Status", "Date", "Contact", "Message"],
]

with open("input.txt") as f:
    header: list[str] = []
    msg = ""

    for line in f:
        # Skip blank lines
        if not line.strip():
            continue

        # Flush any data, reset
        if is_boundary(line) and header and msg:
            rows.append(header + [msg])
            header = []
            msg = ""
            continue

        if is_header(line):
            header = get_header(line)
            continue

        msg += line


# Flush any trailing data
if header and msg:
    rows.append(header + [msg])


with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

You'll see a lot of line.strip() because I want to preserve the whitespace when line represents a part of the message, and that's assuming you can have multi-line messages, like this:

Sent on 1/1/2023 7:30:33 AM to Person1

Message1 line1

-----

Received on 5/20/2023 4:55:33 PM from Person1

Message2 line1
Message2 line2
-----
Sent on 2/9/2023 5:20:12 AM to Person2

Message1 line1
Message1 line2
Message1 line3

-----
Received on 7/4/2023 9:04:00 PM from Person2


Message2 line1
Message2 line2


-----

If your data is as normal as your example, you might like one line = line.strip() at the top of the loop. And then remove all the other .strip() calls.

I made that input TXT especially messy to highlight how flexible the parser is. From that input, it produces this CSV:

+----------+----------------------+---------+----------------+
| Status   | Date                 | Contact | Message        |
+----------+----------------------+---------+----------------+
| Sent     | 1/1/2023 7:30:33 AM  | Person1 | Message1 line1 |
|          |                      |         |                |
+----------+----------------------+---------+----------------+
| Received | 5/20/2023 4:55:33 PM | Person1 | Message2 line1 |
|          |                      |         | Message2 line2 |
|          |                      |         |                |
+----------+----------------------+---------+----------------+
| Sent     | 2/9/2023 5:20:12 AM  | Person2 | Message1 line1 |
|          |                      |         | Message1 line2 |
|          |                      |         | Message1 line3 |
|          |                      |         |                |
+----------+----------------------+---------+----------------+
| Received | 7/4/2023 9:04:00 PM  | Person2 | Message2 line1 |
|          |                      |         | Message2 line2 |
|          |                      |         |                |
+----------+----------------------+---------+----------------+
Zach Young
  • 10,137
  • 4
  • 32
  • 53