How to load a dataframe from a file containing unwanted characters?

Question

I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:

["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]

I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:

Is there a way to load the data properly to pandas?

My apologies for the formatting Bentaye, thank you for the edit! — Sahra Levre, May 14 '18 at 15:02
Can you share your existing code which gives you your dataframe? — jpp, May 14 '18 at 15:23

Davide Fiocco · Answer 1 · 2018-05-14T19:39:32.317

On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via

Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)

sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line

Loading the dataframe (in a python console)

import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')

(not sure that this will work for the entirety of your file though).

Thank you for your response Davide, I'll give this a try and get back to you! — Sahra Levre, May 14 '18 at 16:05

harvpan · Answer 2 · 2018-05-14T15:50:05.023

Consider below code which reads the text in myfile.text which looks like below:

["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]

The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended. Code:

 with open('myfile.txt', 'r') as my_file:
    text = my_file.read()
    text = text.replace("[", "")
    text = text.replace("]", "")

df = pd.DataFrame({
    'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
    'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))

Output:

    Author                             Message
0   littleblackcat    There's a lot of redditors here that live in the area  maybe/hopefully someone saw something. 
1   Kruse             In other words it's basically creating a mini tornado.

Thank you for your response Harv, I have considered this option however my .txt file contains between 3-6 million lines. Hence it is not a feasible solution. — Sahra Levre, May 14 '18 at 16:03

OriolAbril · Answer 3 · 2018-05-14T16:18:40.590

A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:

import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''

df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]

print(df)
# Output (note the space before the There's is also gone
#            Author                                            Message
# 0  littleblackcat  There's a lot of redditors here that live in t...
# 1           Kruse  In other words, it's basically creating a mini...

score 0 · Answer 4 · answered May 14 '18 at 16:02

Here are a few more options to add to the mix:

You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:
```
import pandas as pd
import ast
with open('data', 'r') as f:
    lines = (ast.literal_eval(line) for line in f)
    header = next(lines)
    df = pd.DataFrame(lines, columns=header)
    print(df)
```
Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.

Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)

import io
import pandas as pd

def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
        def __init__(self):
            self.leftover = None
        def readable(self):
            return True
        def readinto(self, b):
            try:
                l = len(b)  # We're supposed to return at most this much
                chunk = self.leftover or next(iterable)
                output, self.leftover = chunk[:l], chunk[l:]
                b[:len(output)] = output
                return len(output)
            except StopIteration:
                return 0    # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)

def clean(f):
    for line in f:
        yield line.strip()[1:-1]+b'\n'

with open('data', 'rb') as f:
    # https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
    df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
    print(df)

Thank you very much for your response Unutbu! – Sahra Levre May 14 '18 at 16:17 — Sahra Levre, May 14 '18 at 16:17

score -1 · Answer 5 · answered May 14 '18 at 15:59

-1

For now the following solution was found:

sep = '[|"|]'

Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.

answered May 14 '18 at 15:59

Sahra Levre

51
8

pandas has some string methods implemented for Series objects. Thus, words can be stripped one whole column at a time with the apply method – OriolAbril May 15 '18 at 13:00

How to load a dataframe from a file containing unwanted characters?

5 Answers5