Split txt file into multiple new files with regex

Question

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.

I have a txt file of Letters to the Editor that I need to split into their own individual files.

The files are all formatted in relatively the same way with:

For once, before offering such generous but the unasked for advice, put yourselves in...

Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...

Why is it that The Times does not urge totalitarian Arab slates and terrorist...

PAUL STONEHILL Los Angeles

There you go again. Your editorial again makes groundless criticisms of the Israeli...

On Dec. 7 you called proportional representation “bizarre," despite its use in the...

Proportional representation distorts Israeli politics? Huh? If Israel changes the...

MATTHEW SHUGART Laguna Beach

Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...

Although the mayor did not support Proposition U (the slow-growth initiative) his...

If West Los Angeles is any indication of the no-growth policy, where do we go from here?

MARJORIE L. SCHWARTZ Los Angeles

I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.

I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.

The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.

import re
thefile = raw_input('Filename to split: ')
name_occur = [] 
full_file = []
pattern = re.compile("^[A-Z]{4,}")

with open (thefile, 'rt') as in_file:
    for line in in_file:
        full_file.append(line)
        if pattern.search(line):
            name_occur.append(line) 

totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)

while letters <= totalFiles:
    f1 = open(thefile + '-' + str(letters) + ".txt", "a")
    doIHaveToCopyTheLine = False
    ignoreLines = False
    for line in full_file:
        if not ignoreLines:
            f1.write(line)
            full_file.remove(line)
        if pattern.search(line):
            doIHaveToCopyTheLine = True
            ignoreLines = True
    letters += 1
    f1.close()

I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.

I'd suggest splitting the program in smaller functions, e.g.: "read file lines into list", "check whether the line should start a new file", "split list of lines into list of lists of lines, each list being content of the new file", "write list of lines into a file". Actually, the first and the last functions are already implemented in Python (`readlines` and `writelines` methods). — yeputons, Feb 07 '17 at 05:16
[Good reading about debugging](https://ericlippert.com/2014/03/05/how-to-debug-small-programs/). Say, I don't really understand the logic of your `while`/`for` loops in the end: what are their [invariants](https://en.wikipedia.org/wiki/Invariant_(computer_science)), e.g. conditions which should hold before each iterations of each cycle? Few more notices: `doIHaveToCopyTheLine` variable is not used at all, and `ignoreLines` variable can be replaced with `break` statement. — yeputons, Feb 07 '17 at 05:19
@yeputons as to your first comment: that's what I thought I should do when I started this, but I don't know how. As to your second comment, I'm not sure what my loop is doing either...I'm cobbling code together as I go and encounter a new problem and trying to get it to work. So your confusion is my confusion as well. — Sasha Hoffman, Feb 07 '17 at 17:50

score 1 · Accepted Answer · answered Feb 07 '17 at 07:05

1

I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.

import string

def split_letters(fullpath):
    current_letter = []
    letter_index = 1
    fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)

    with open(fullpath, 'r') as letters_file:
        letters = letters_file.readlines()
    for line in letters:
        words = line.split()
        upper_words = []
        for word in words:
            upper_word = ''.join(
                c for c in word if c in string.ascii_uppercase)
            upper_words.append(upper_word)

        len_upper_words = len(upper_words)
        first_word_upper = len_upper_words and len(upper_words[0]) > 1
        second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
        third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
        if first_word_upper and (second_word_upper or third_word_upper):
            current_letter.append(line)
            new_filename = '{0}{1}.{2}'.format(
                fullpath_base, letter_index, fullpath_ext)
            with open(new_filename, 'w') as new_letter:
                new_letter.writelines(current_letter)
            current_letter = []
            letter_index += 1

        else:
            current_letter.append(line)

I tested it on your sample input and it worked fine.

answered Feb 07 '17 at 07:05

mVChr

49,587
11
107
104

I like this approach, but the sample data is too small to contain any interesting corner cases. What about "WILLIAM de GEER" or "e e cummings"? If there are no such problems in the real data, why not just check for the first two words being all uppercase and the last word being proper case and allowing the pattern to change only once in between? (This might again be easier with regex.) – tripleee Feb 07 '17 at 08:17
@tripleee That's why I said "you can adjust this if needed." The strategy is sound, but you'll never get a foolproof solution with data like this, best you can hope is to minimize errors. – mVChr Feb 07 '17 at 20:09
1

@mVChr your approach worked perfectly! And you are completely correct, the data is imperfect and no code will be able to account for every anomaly (especially since the text is OCR from a newspaper) but your code has allowed me to build in some safety nets that will help minimize the errors and get everything (mostly) separated appropriately. I don't need it to be perfect perfect, but I needed it to be mostly perfect. Which this does. Thank you SOOOO much. – Sasha Hoffman Feb 08 '17 at 00:29
@mVChr out of curiosity any idea how this would be structured if the name was the _beginning_ of the letter and not the end? If I understand this correct, the def is looping through _until_ it finds the capitalized name. But what if the capitalized name started the letter and the next capitalized name was the start of a new letter? – Sasha Hoffman Feb 10 '17 at 21:07
Next time I challenge you to work it out yourself, but I'll tell you this time. Notice that each `current_letter` is a list of lines that we append to and then write to a file. Here, when we find the author line, since it's the last line we append it to `current_letter`, write the file and create a new `current_letter = []`. If it were the first line, we'd write `current_letter` to a file if it wasn't empty, then create a new `current_letter = []` and append the author line to it to start the next letter. Best of luck to you in your journey learning to program! – mVChr Feb 11 '17 at 16:54
1

thanks! I'm slowly figuring out the logic of python and how it needs to be structured. Having you help me to "see" it both ways has really helped me understand how all this functions!!!! – Sasha Hoffman Feb 11 '17 at 20:42

gregory · Answer 2 · 2017-02-08T17:24:13.507

1

While the other answer is suitable, you may still be curious about using a regex to split up a file.

   smallfile = None
   buf = ""
   with  open ('input_file.txt', 'rt') as f:
      for line in f:
          buf += str(line)
          if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
              if smallfile:
                  smallfile.close()
              match = re.findall(r'^([A-Z\s\.]+\b)' , line)
              smallfile_name = '{}.txt'.format(match[0])
              smallfile = open(smallfile_name, 'w')
              smallfile.write(buf)
              buf = ""
      if smallfile:
          smallfile.close()

edited Feb 08 '17 at 17:24

answered Feb 07 '17 at 08:02

gregory

10,969
2
30
42

I've never had a problem with finding the lines that had the names in it. My problem has always been how to isolate the paragraphs before and between those names to then write into new files. – Sasha Hoffman Feb 07 '17 at 17:51
@Sasha Hoffman, Ah, ok, I guess I was thrown off by: "It all seems to not work when I have to adjust it to accept my regex of all capital words." – gregory Feb 08 '17 at 02:59
sorry if I wasn't clear. I meant that any of the other answers that I had found, I couldn't figure out how to adjust them to work with the regex because they were written based on a static, repeatable word not a fluctuating pattern. – Sasha Hoffman Feb 08 '17 at 13:38
@SashaHoffman, ah gotcha. So, I was curious about this one and drafted an example which added to the answer. I guess I still believe using a regex here is viable if not more adaptable approach. – gregory Feb 08 '17 at 17:20

score 1 · Answer 3 · answered Oct 27 '19 at 09:05

1

If you run on Linux, use csplit.

Otherwise, check out these two threads:

How can I split a text file into multiple text files using python?

How to match "anything up until this sequence of characters" in a regular expression?

answered Oct 27 '19 at 09:05

Mohl

405
2
16

Split txt file into multiple new files with regex

3 Answers3