2

I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:

["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]

I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:

Pandas Data

Is there a way to load the data properly to pandas?

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72

5 Answers5

1

On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via

  1. Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)

    sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
    sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
    
  2. Loading the dataframe (in a python console)

    import pandas as pd
    pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
    

(not sure that this will work for the entirety of your file though).

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
0

Consider below code which reads the text in myfile.text which looks like below:

["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]

The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended. Code:

 with open('myfile.txt', 'r') as my_file:
    text = my_file.read()
    text = text.replace("[", "")
    text = text.replace("]", "")

df = pd.DataFrame({
    'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
    'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))

Output:

    Author                             Message
0   littleblackcat    There's a lot of redditors here that live in the area  maybe/hopefully someone saw something. 
1   Kruse             In other words it's basically creating a mini tornado.
harvpan
  • 8,571
  • 2
  • 18
  • 36
  • Thank you for your response Harv, I have considered this option however my .txt file contains between 3-6 million lines. Hence it is not a feasible solution. – Sahra Levre May 14 '18 at 16:03
0

A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:

import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''

df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]

print(df)
# Output (note the space before the There's is also gone
#            Author                                            Message
# 0  littleblackcat  There's a lot of redditors here that live in t...
# 1           Kruse  In other words, it's basically creating a mini...
OriolAbril
  • 7,315
  • 4
  • 29
  • 40
0

Here are a few more options to add to the mix:

  1. You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:

    import pandas as pd
    import ast
    with open('data', 'r') as f:
        lines = (ast.literal_eval(line) for line in f)
        header = next(lines)
        df = pd.DataFrame(lines, columns=header)
        print(df)
    

    Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.

  2. Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)

    import io
    import pandas as pd
    
    def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
        """
        http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
        Lets you use an iterable (e.g. a generator) that yields bytestrings as a
        read-only input stream.
    
        The stream implements Python 3's newer I/O API (available in Python 2's io
        module).
    
        For efficiency, the stream is buffered.
        """
        class IterStream(io.RawIOBase):
            def __init__(self):
                self.leftover = None
            def readable(self):
                return True
            def readinto(self, b):
                try:
                    l = len(b)  # We're supposed to return at most this much
                    chunk = self.leftover or next(iterable)
                    output, self.leftover = chunk[:l], chunk[l:]
                    b[:len(output)] = output
                    return len(output)
                except StopIteration:
                    return 0    # indicate EOF
        return io.BufferedReader(IterStream(), buffer_size=buffer_size)
    
    def clean(f):
        for line in f:
            yield line.strip()[1:-1]+b'\n'
    
    with open('data', 'rb') as f:
        # https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
        df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
        print(df)
    
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
-1

For now the following solution was found:

sep = '[|"|]'

Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.

  • pandas has some string methods implemented for Series objects. Thus, words can be stripped one whole column at a time with the apply method – OriolAbril May 15 '18 at 13:00