1

Python
I want to split a string into parts which have at most 5000 characters. (We also need to be aware not to split it when we are in a word, and split it only if we found a space.)
I iterated through the string character by character, and every 4980 characters I split it into parts, and then if there remains a part which is less than 4980 I translate that too. I am new to python, so I'm sure my method is a mess, which works, but certainly isn't good code.
I haven't checked for any spaces in the string because in Japanese and Chinese there aren't spaces, but this would need to be checked too so we don't split a word into two parts.

with open('lightnovel.txt', 'r', encoding="utf8") as f:
file = f.read()

db = 0
partofbook = u''
last = u''
length = len(file)
mult = 0
for character in file:
    db = db + 1
    partofbook = partofbook + character
    if db > 4880:
        mult += 1
        db = 0
        trans(partofbook)
        partofbook = u''
    elif length - (mult * 4980) > 0 and length - (mult * 4980) < 5000 :
        last = last + character
        do = 1
if do == 1:
    trans(last)
Dani Suba
  • 27
  • 6
  • Why don't you start at index 5000, iterate backwards till you find whitespace at position A, let's say, then your first output is string[0,A-1]. Then jump ahead to index A+5000 and do the same thing, searching backwards for whitespace, found at index B, so your next output is string[A, B-1]. Repeat until done. Obviously check that you don't skip beyond len(string). – jarmod Mar 01 '21 at 18:52
  • Thank you, this is a great idea! – Dani Suba Mar 01 '21 at 19:01
  • Can you post this comment as an answer so I can check it as a solution? – Dani Suba Mar 01 '21 at 19:05
  • Yes, see [How to get char from string by index?](https://stackoverflow.com/questions/8848294/how-to-get-char-from-string-by-index) and [ ](https://stackoverflow.com/questions/663171/how-do-i-get-a-substring-of-a-string-in-python) – jarmod Mar 01 '21 at 19:06

2 Answers2

1

I'm also new to python so I apologise for not implementing this into your code.

there is a function called string.split() (where string is the sentence you want to split).

this function would split only when there is a space.

MSS98
  • 15
  • 7
  • The problem is that this doesn't split it by length, but occurences like a w3schools example says: apple#banana#orange would give apple, banana, and orange in a list if we choose to split by "#". I haven't found a way to use this function with length parameters. – Dani Suba Mar 01 '21 at 19:00
0

I would start at index 5000, iterate backwards till you find whitespace at position A, let's say, then your first output is string[0,A-1] (in Python, you can use s[0:A] to get this substring).

Then jump ahead to index A+5000 and do the same thing, searching backwards for whitespace, found at index B, so your next output is string[A, B-1] (in Python you can use s[A+1:B] to get this substring). Note: it's A+1 because you want to skip the whitespace found at index A.

Repeat until done. Obviously check that you don't skip beyond len(string).

Also, see

jarmod
  • 71,565
  • 16
  • 115
  • 122