-2

I am putting three fields author, epoch_date and text in one string textData They are separated with new row. Of course text can contain multiple rows or some special characters (#, blank spaces etc.)

textData="author:john"+"\n"+"epoch_date:53636352"+"\n"+"text:comment on_the_first_line \n line_in_the_middle \n line_at_the_end"

Now I am trying convinient way to extract those data in the easiest way in separate fields. I do not see how to do it with splitlines() since it will print me 5 rows instead of 3 because as mentioned text field can contain multiple rows.

for line in textData.splitlines():
    print(line)

Also if there is better way to define textData field I am able to modify that part as well. Note: python 2.7 must be used, unfortunately.

Thanks

vel
  • 1,000
  • 1
  • 13
  • 35
  • 2
    Since you are specifying the delimiter, perhaps use a delimiter which will not already exist in the text. (It appears as though the 'problem' you are trying to solve, is self-created ...) – S3DEV Jul 14 '22 at 15:10
  • 1
    If `splitlines` is problematic, why did you ***choose*** to use `'\n'` as the separator? Why are you even merging the "fields" into one long string? Why not a list? Or nothing at all... – Tomerikoo Jul 14 '22 at 15:10
  • 1
    What are you doing with those fields and with `textData`? There are many ways to go about this so some context will help to find the optimal solution. This is likely an XY-problem, where you ask about what you think is a solution to a problem, instead of asking about ***the problem*** – Tomerikoo Jul 14 '22 at 15:13
  • @S3DEV "use a delimiter which will not exist in the text". That is the issue I CAN NOT KNOW what will not exist in the text... I have around 500,000 texts that I need to process – vel Jul 14 '22 at 15:21
  • @Tomerikoo thanks for your assistance. In this stage I need to save 3 fields - limit is that they must be saved into the string field. In later phase from that string field I need to extract those three values in separate fields. Thanks a lot... – vel Jul 14 '22 at 15:22
  • @AndreyS - Ok, calm down. So you've chosen a delimiter which you *know* exists in the text. Got it. A common delimiter which is generally not found in text is the pipe character. For example: `|` – S3DEV Jul 14 '22 at 15:24
  • @S3DEV thanks but I can't be sure that `|` does not exist in 500000 values of string. I intenetionally added the text at the end of this string field to make things easier because I know that it can contain \n, if I want to differentiate rows based on that sign... I hope I can clarify it properly – vel Jul 14 '22 at 15:26
  • If you really must, just come up with an obscure separator that will never occur in a text, like `'*--^_^--*'`. Then you can always do `textData.split('*--^_^--*')`... – Tomerikoo Jul 14 '22 at 15:33
  • Does this answer your question? [Best method of saving data](https://stackoverflow.com/q/14509269/6045800) – Tomerikoo Jul 14 '22 at 16:11

1 Answers1

2

Instead of splitlines, which splits all the lines, manually split by \n with an additional maxsplit parameter, then split the parts by :, again with maxsplit, since there could be a : in the comment.

>>> textData="author:john"+"\n"+"epoch_date:53636352"+"\n"+"text:comment on_the_first_line \n line_in_the_middle \n line_at_the_end"
>>> [part.split(":", 1) for part in textData.split("\n", 2)]
[['author', 'john'], ['epoch_date', '53636352'], ['text', 'comment on_the_first_line \n line_in_the_middle \n line_at_the_end']]

Other than splitlines, this does not handle all the different options for line-ends, but if your data is coming from the same source all the time, this might not be a problem (or be solved in preprocessing with replace or similar).


Alternatively, if you actually control the data, but have to convert it to a single string at some point, and then parse it back, consider changing the format, e.g. using JSON:

>>> import json
>>> d = {'text': 'comment on_the_first_line \n line_in_the_middle \n line_at_the_end', 'epoch_date': '53636352', 'author': 'john'}
>>> s = json.dumps(d)
>>> s
'{"text": "comment on_the_first_line \\n line_in_the_middle \\n line_at_the_end", "epoch_date": "53636352", "author": "john"}'
>>> json.loads(s)
{u'text': u'comment on_the_first_line \n line_in_the_middle \n line_at_the_end', u'epoch_date': u'53636352', u'author': u'john'}
tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • thank you. This is exactly what I needed. I can control format how I will save it only limit is that it must be saved into the variable of type string, from which, in later phase I must do the extract of the each field in it. Thanks a lot, accepting the answer – vel Jul 14 '22 at 16:01