Pandas/Python Reading TXT file that has structure like dictionary and sort them into dataframes

Question

Original title:Pandas/Python Drop columns base on syntax ':' existence in cell

Since the the solution went a totally different approach I changed the title

============

Consider a dataframe from txt file.

Data txt file: txt

I removed the unwanted lines that contains "Iteration"

import pandas as pd
data = pd.read_csv('results.txt', sep=" ", header=None)
data = data[~data[1].astype(str).str.startswith('Iteration')]

Now its like this: pd

I wanted to remove all the columns that contains ":" to result something like: {2, 7}{2, 7}{2, 7}{2, 7}{1, 2, 7}{1, 2, 6, 7} {1, 2, 6, 7}{1, 2, 3, 6, 7}{1, 2, 3, 6, 7} {1, 3, 4, 6, 7}{1, 3, 4, 5, 6} {3, 4, 5, 6, 7}{1, 3, 4, 5, 7} {1, 4, 5, 6, 7}, .......etc.

and to merge columns such as: "{2," " 7}," into one column "{2,7}" Then I can replace all the column names with timestamps (6:30, 7:00 etc) to make it a proper table like :

6:30 7:00 7:30 8:00 8:30 9:00

{2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7}

{5,6} {5,6} {5,6} {5,6} {1,5,6} {1,4,5,6}

....

My first step now is to remove columns with ":" , I tried these:

data.loc[:, ~(data[0:122].astype(str).apply(lambda x: x.str.contains(':'))).any()]
data.loc[:, ~(data.contains(':')).any()]

obviously they won't work since contains() cannot be use on data frames, I am not very familier with pandas. I searched for many other answers and wasn't able to get it right. I wonder what is the correct way to remove columns with cells that contains ":"

Or if my approch full of unnecessary steps, its there a better approach to turn the txt file into a dataframe of

6:30 7:00 7:30 8:00 8:30 9:00

{2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7}

{5,6} {5,6} {5,6} {5,6} {1,5,6,} {1,4,5,6}

like sort it when reading the txt file at first place and loop through lines and make dictionaries instead? '6:30': {2, 7} ....

Corralien · Accepted Answer · 2022-06-22T19:51:17.900

Preprocess your file before create dataframe:

import pandas as pd
import ast

data = []
with open('results.txt') as fp:
    for line in fp:
        if line.startswith('Shifts:'):
            data.append(ast.literal_eval(line.split(':', maxsplit=1)[1].strip()))
df = pd.DataFrame.from_dict(data)

Important update: BUG FIX

ast.literal_eval works well here but I can guarantee the order of the sets after evaluation

>>> ast.literal_eval('{8, 2, 3, 7}')
{2, 3, 7, 8}  # We don't want that!

Maybe you can use re.findall to get the same output:

import pandas as pd
import re

data = []
with open('results.txt') as fp:
    for line in fp:
        if line.startswith('Shifts:'):
            data.append(dict(re.findall(r"'([^']+)': (\{[^\}]+\})", line[1:-1])))
df = pd.DataFrame.from_dict(data)

The output still the same but the cell values are not set but str:

>>> df
       6:30    7:00    7:30    8:00       8:30          9:00          9:30  ...            15:30            16:00         16:30   17:00   17:30   18:00   18:30
0    {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  {1, 2, 6, 7}  {1, 2, 6, 7}  ...  {1, 3, 4, 5, 8}  {1, 3, 4, 5, 8}  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
1    {5, 6}  {5, 6}  {5, 6}  {5, 6}  {1, 5, 6}  {1, 4, 5, 6}  {1, 4, 5, 6}  ...  {1, 2, 3, 7, 8}  {1, 2, 3, 7, 8}  {8, 2, 3, 7}  {8, 3}  {8, 3}  {8, 3}  {8, 3}
2    {8, 1}  {8, 1}  {8, 1}  {8, 1}  {8, 1, 7}  {8, 1, 2, 7}  {8, 1, 2, 7}  ...  {3, 4, 5, 6, 7}  {3, 4, 5, 6, 7}  {3, 4, 5, 6}  {4, 5}  {4, 5}  {4, 5}  {4, 5}
3    {1, 6}  {1, 6}  {1, 6}  {1, 6}  {1, 5, 6}  {1, 4, 5, 6}  {1, 4, 5, 6}  ...  {2, 3, 5, 7, 8}  {2, 3, 5, 7, 8}  {8, 2, 3, 7}  {2, 7}  {2, 7}  {2, 7}  {2, 7}
4    {3, 4}  {3, 4}  {3, 4}  {3, 4}  {3, 4, 5}  {2, 3, 4, 5}  {2, 3, 4, 5}  ...  {1, 5, 6, 7, 8}  {1, 5, 6, 7, 8}  {8, 1, 6, 7}  {8, 7}  {8, 7}  {8, 7}  {8, 7}
..      ...     ...     ...     ...        ...           ...           ...  ...              ...              ...           ...     ...     ...     ...     ...
142  {5, 6}  {5, 6}  {5, 6}  {5, 6}  {5, 6, 7}  {2, 5, 6, 7}  {2, 5, 6, 7}  ...  {1, 3, 4, 7, 8}  {1, 3, 4, 7, 8}  {8, 1, 3, 4}  {1, 4}  {1, 4}  {1, 4}  {1, 4}
143  {5, 6}  {5, 6}  {5, 6}  {5, 6}  {3, 5, 6}  {2, 3, 5, 6}  {2, 3, 5, 6}  ...  {1, 3, 4, 7, 8}  {1, 3, 4, 7, 8}  {8, 1, 4, 7}  {8, 1}  {8, 1}  {8, 1}  {8, 1}
144  {1, 2}  {1, 2}  {1, 2}  {1, 2}  {1, 2, 5}  {8, 1, 2, 5}  {8, 1, 2, 5}  ...  {3, 4, 5, 6, 7}  {3, 4, 5, 6, 7}  {3, 4, 6, 7}  {4, 7}  {4, 7}  {4, 7}  {4, 7}
145  {8, 7}  {8, 7}  {8, 7}  {8, 7}  {8, 5, 7}  {8, 5, 6, 7}  {8, 5, 6, 7}  ...  {1, 2, 3, 4, 5}  {1, 2, 3, 4, 5}  {1, 2, 3, 4}  {2, 3}  {2, 3}  {2, 3}  {2, 3}
146  {1, 4}  {1, 4}  {1, 4}  {1, 4}  {1, 4, 5}  {1, 4, 5, 6}  {1, 4, 5, 6}  ...  {2, 3, 5, 7, 8}  {2, 3, 5, 7, 8}  {8, 2, 3, 7}  {8, 7}  {8, 7}  {8, 7}  {8, 7}

[147 rows x 25 columns]

Update 2:

The order of set after the evaluation of a string representation of a set is not guarantee except (???) for numbers between -5 and 256 (CPython only). Even if numbers are in this interval, set is an unordered collection

>>> print(eval('{8000, 2000, 3000, 7000}'))
{8000, 7000, 3000, 2000}

>>> print(ast.literal_eval('{8000, 2000, 3000, 7000}'))
{8000, 7000, 3000, 2000}

>>> print(eval('{8, 2, 3, 7}'))
{8, 2, 3, 7}

>>> print(ast.literal_eval('{8, 2, 3, 7}'))
{8, 2, 3, 7}

>>> print(ast.literal_eval('{8, 2000, 3000, 7}'))
{8, 7, 3000, 2000}

>>> print(eval('{8, 2000, 3000, 7}'))
{8, 2000, 3000, 7}

`ast.literal_eval` works here because you have a valid string representation of a dict. Sometime it's easier to preprocess data rather than using Pandas :) — Corralien, Jun 22 '22 at 06:33
That's a great counter write up, thank you for pointing out the differences. @Corralien — BeRT2me, Jun 22 '22 at 16:59
Thanks, I updated my output (bad copy/paste since I used `d` instead of `line`. — Corralien, Jun 22 '22 at 19:46

BeRT2me · Answer 2 · 2022-06-22T17:08:12.847

A set may not maintain order, but a list will, so let's change all the sets to lists before evaluating them:

import ast

data = []
with open('results.txt') as f:
    for line in f:
        if line.startswith('Shifts: '):
            line = line.strip()[9:-1]
            line = line.replace('{', '[').replace('}', ']')
            line = '{' + line + '}'
            data.append(ast.literal_eval(line))

df = pd.DataFrame(data)
print(df.head())

Output:

       6:30    7:00    7:30    8:00       8:30          9:00          9:30  ...            15:30            16:00         16:30   17:00   17:30   18:00   18:30
0    [2, 7]  [2, 7]  [2, 7]  [2, 7]  [1, 2, 7]  [1, 2, 6, 7]  [1, 2, 6, 7]  ...  [1, 3, 4, 5, 8]  [1, 3, 4, 5, 8]  [8, 3, 4, 5]  [8, 5]  [8, 5]  [8, 5]  [8, 5]
1    [5, 6]  [5, 6]  [5, 6]  [5, 6]  [1, 5, 6]  [1, 4, 5, 6]  [1, 4, 5, 6]  ...  [1, 2, 3, 7, 8]  [1, 2, 3, 7, 8]  [8, 2, 3, 7]  [8, 3]  [8, 3]  [8, 3]  [8, 3]
2    [8, 1]  [8, 1]  [8, 1]  [8, 1]  [8, 1, 7]  [8, 1, 2, 7]  [8, 1, 2, 7]  ...  [3, 4, 5, 6, 7]  [3, 4, 5, 6, 7]  [3, 4, 5, 6]  [4, 5]  [4, 5]  [4, 5]  [4, 5]
3    [1, 6]  [1, 6]  [1, 6]  [1, 6]  [1, 5, 6]  [1, 4, 5, 6]  [1, 4, 5, 6]  ...  [2, 3, 5, 7, 8]  [2, 3, 5, 7, 8]  [8, 2, 3, 7]  [2, 7]  [2, 7]  [2, 7]  [2, 7]
4    [3, 4]  [3, 4]  [3, 4]  [3, 4]  [3, 4, 5]  [2, 3, 4, 5]  [2, 3, 4, 5]  ...  [1, 5, 6, 7, 8]  [1, 5, 6, 7, 8]  [8, 1, 6, 7]  [8, 7]  [8, 7]  [8, 7]  [8, 7]
..      ...     ...     ...     ...        ...           ...           ...  ...              ...              ...           ...     ...     ...     ...     ...
142  [5, 6]  [5, 6]  [5, 6]  [5, 6]  [5, 6, 7]  [2, 5, 6, 7]  [2, 5, 6, 7]  ...  [1, 3, 4, 7, 8]  [1, 3, 4, 7, 8]  [8, 1, 3, 4]  [1, 4]  [1, 4]  [1, 4]  [1, 4]
143  [5, 6]  [5, 6]  [5, 6]  [5, 6]  [3, 5, 6]  [2, 3, 5, 6]  [2, 3, 5, 6]  ...  [1, 3, 4, 7, 8]  [1, 3, 4, 7, 8]  [8, 1, 4, 7]  [8, 1]  [8, 1]  [8, 1]  [8, 1]
144  [1, 2]  [1, 2]  [1, 2]  [1, 2]  [1, 2, 5]  [8, 1, 2, 5]  [8, 1, 2, 5]  ...  [3, 4, 5, 6, 7]  [3, 4, 5, 6, 7]  [3, 4, 6, 7]  [4, 7]  [4, 7]  [4, 7]  [4, 7]
145  [8, 7]  [8, 7]  [8, 7]  [8, 7]  [8, 5, 7]  [8, 5, 6, 7]  [8, 5, 6, 7]  ...  [1, 2, 3, 4, 5]  [1, 2, 3, 4, 5]  [1, 2, 3, 4]  [2, 3]  [2, 3]  [2, 3]  [2, 3]
146  [1, 4]  [1, 4]  [1, 4]  [1, 4]  [1, 4, 5]  [1, 4, 5, 6]  [1, 4, 5, 6]  ...  [2, 3, 5, 7, 8]  [2, 3, 5, 7, 8]  [8, 2, 3, 7]  [8, 7]  [8, 7]  [8, 7]  [8, 7]

[147 rows x 25 columns]

`eval` is dangerous and has the same problem than `ast.literal_eval`. If it works, it's because integers between -5 and 256 are [preallocated in memory](https://docs.python.org/3/c-api/long.html#c.PyLong_FromLong). `print(eval('{8000, 2000, 3000, 7000}'))` and `print(ast.literal_eval('{8000, 2000, 3000, 7000}'))` both return `{8000, 7000, 3000, 2000}` which it's not the OP wants. — Corralien, Jun 22 '22 at 13:21
Yes I could look it up, but why is `ast.literal_eval` any 'safer' than `eval`? @Corralien — BeRT2me, Jun 22 '22 at 16:55
@LenKazuma I've modified my answer to a way that will maintain your value's order, namely lists, while still letting you perform actions on them in a similar way you would with sets. If this no longer works well for your use case, I'd use one of Corralien's alternative options while keeping in mind the caveats they've pointed out. — BeRT2me, Jun 22 '22 at 17:20
`eval` is dangerous especially when you eval user input (this is not the case here). `literal_eval` evaluates expressions only not statements. You can read [this](https://stackoverflow.com/questions/15197673/using-pythons-eval-vs-ast-literal-eval) to know more. — Corralien, Jun 22 '22 at 19:50
In my limited knowledge base I know set is an unordered "it does not preserve the insertion order" as in this[link](https://stackoverflow.com/questions/9792664/converting-a-list-to-a-set-changes-element-order) So I get if I need the order I will do list instead. Thanks for pointing out. So for I think order will not effect my goal for this case, I'll try list for other datasets that need to maintain order. Thanks! — LenKazuma, Jun 23 '22 at 02:57

Pandas/Python Reading TXT file that has structure like dictionary and sort them into dataframes

2 Answers2