1

I have one single txt file, i would like to split it into many files according to the *TEXT ID

for example: the single txt file looks like this

*TEXT 017 01/04/63 PAGE 020
THE ALLIES AFTER NASSAU IN DECEMBER 1960, THE U.S . FIRST
PROPOSED TO HELP NATO DEVELOP ITS OWN NUCLEAR STRIKE FORCE . BUT EUROPE.....
*TEXT 018 01/04/63 PAGE 021
RUSSIA WHO'S IN CHARGE HERE ? IT WAS IN 1954 THAT NIKITA
KHRUSHCHEV LAUNCHED HIS GRANDIOSE " VIRGIN LANDS " GAMBLE . PART OF THE.....
*TEXT 019 01/04/63 PAGE 021
BERLIN ONE LAST RUN HANS WEIDNER HAD BEEN HOPING FOR MONTHS TO
ESCAPE DRAB EAST GERMANY AND MAKE HIS WAY TO THE WEST . THE ODDS WERE
AGAINST HIM, FOR WEIDNER, 40, WAS A....

how to split into multiple txt files??

filename:
TEXT017.txt

filename:
TEXT018.txt

filename:
TEXT019.txt
dd90p
  • 503
  • 1
  • 7
  • 15

2 Answers2

2

Split the text file into lines by what demarcates the beginning of a new text ID:

import re

raw_string = """*TEXT 017 01/04/63 PAGE 020
THE ALLIES AFTER NASSAU IN DECEMBER 1960, THE U.S . FIRST
PROPOSED TO HELP NATO DEVELOP ITS OWN NUCLEAR STRIKE FORCE . BUT EUROPE.....
*TEXT 018 01/04/63 PAGE 021
RUSSIA WHO'S IN CHARGE HERE ? IT WAS IN 1954 THAT NIKITA
KHRUSHCHEV LAUNCHED HIS GRANDIOSE " VIRGIN LANDS " GAMBLE . PART OF THE.....
*TEXT 019 01/04/63 PAGE 021
BERLIN ONE LAST RUN HANS WEIDNER HAD BEEN HOPING FOR MONTHS TO
ESCAPE DRAB EAST GERMANY AND MAKE HIS WAY TO THE WEST . THE ODDS WERE
AGAINST HIM, FOR WEIDNER, 40, WAS A...."""

split_string = re.split('(.*TEXT .*PAGE \d+)', raw_string)
for item in split_stuff:
    print('------')
    print(item)

------
*TEXT 017 01/04/63 PAGE 020
------

THE ALLIES AFTER NASSAU IN DECEMBER 1960, THE U.S . FIRST
PROPOSED TO HELP NATO DEVELOP ITS OWN NUCLEAR STRIKE FORCE . BUT EUROPE.....

------
*TEXT 018 01/04/63 PAGE 021
------

RUSSIA WHO'S IN CHARGE HERE ? IT WAS IN 1954 THAT NIKITA
KHRUSHCHEV LAUNCHED HIS GRANDIOSE " VIRGIN LANDS " GAMBLE . PART OF THE.....

------
*TEXT 019 01/04/63 PAGE 021
------

BERLIN ONE LAST RUN HANS WEIDNER HAD BEEN HOPING FOR MONTHS TO
ESCAPE DRAB EAST GERMANY AND MAKE HIS WAY TO THE WEST . THE ODDS WERE
AGAINST HIM, FOR WEIDNER, 40, WAS A....
n1c9
  • 2,662
  • 3
  • 32
  • 52
  • I mean save the "THE ALLIES AFTER NASSAU IN DECEMBER 1960, THE U.S . FIRST PROPOSED TO HELP NATO DEVELOP ITS OWN NUCLEAR STRIKE FORCE . BUT EUROPE....." into file name as "TEXT017.txt". – dd90p Nov 24 '16 at 05:17
2

inspired by @n1c9 , I modified and added something to make it completed.

import re

raw_string = """*TEXT 017 01/04/63 PAGE 020
THE ALLIES AFTER NASSAU IN DECEMBER 1960, THE U.S . FIRST
PROPOSED TO HELP NATO DEVELOP ITS OWN NUCLEAR STRIKE FORCE . BUT EUROPE.....
*TEXT 018 01/04/63 PAGE 021
RUSSIA WHO'S IN CHARGE HERE ? IT WAS IN 1954 THAT NIKITA
KHRUSHCHEV LAUNCHED HIS GRANDIOSE " VIRGIN LANDS " GAMBLE . PART OF THE.....
*TEXT 019 01/04/63 PAGE 021
BERLIN ONE LAST RUN HANS WEIDNER HAD BEEN HOPING FOR MONTHS TO
ESCAPE DRAB EAST GERMANY AND MAKE HIS WAY TO THE WEST . THE ODDS WERE
AGAINST HIM, FOR WEIDNER, 40, WAS A...."""

split_strings = re.split('\n?(\*TEXT .*)\n', raw_string)
blocks = [s for s in split_strings if s] # filter some blank strings

for i in range(0, len(blocks), 2):
    # extract `019` from `*TEXT 019 01/04/63 PAGE 021`
    num = re.search('TEXT (\d+)', blocks[i]).group(1)

    # save content to `TEXT019.txt`
    filename = 'TEXT%s.txt' % num
    content = blocks[i+1]
    with open(filename, 'w+') as fp:
        fp.write(content)
Anyany Pan
  • 659
  • 7
  • 9