2

I have a text file having lot of rows with 6 columns in each row but there is a \n after every fourth column as well as every 6th column, something like:

Row 1 ---> 1 2 3 4\n 5 6\n

Row 2 ---> 7 8 9 10\n 11 12\n

I am using the command to create dataframe from the file:

df = pd.read_csv('info.txt', header=None, delimiter=r"\s+", names = cols, lineterminator='\n')

But, pandas read_csv is reading the above data as 4 rows even if I am explicitly providing the names of the 6 columns in names attribute of read_csv:

   col1 col2 col3 col4 col5 col6
0   1    2   3    4    NaN  NaN
1   5    6   NaN  NaN  NaN  NaN
2   7    8   9    10   NaN  NaN
3   11   12  NaN  NaN  NaN  NaN

How can I read the data as :

   col1 col2 col3 col4 col5 col6
0   1    2   3    4    5    6
1   7    8   9    10   11   12
  • What is your line terminator in the file? I mean the symbol at the end of `1 2 3 4\n 5 6\n` ? Do you have a windows/mac line-ending (`\r`, `\r\n`)? – Alexander Volkovsky Jun 04 '21 at 12:40
  • I did a open('info.txt','r+b').read() on the text file and I can see the numerical data and the \n characters written in the pattern e.g.: 61 4 2 242\n 392 4\n , so the line terminator should be \n but it's appearing twice in a row hence creating the problem. There is no other distinguishing symbol after the second \n and new row values start after the second \n in the same pattern. – Himanshu Singh Jun 04 '21 at 14:07
  • You probably has a non-unix line endings (not a `\n`). Otherwise you would get `61 4 2 242` and `392 4` as a separate lines. You can try to find your line endings using https://stackoverflow.com/questions/3569997/how-to-find-out-line-endings-in-a-text-file – Alexander Volkovsky Jun 04 '21 at 14:13
  • I ran `file info.txt` and its giving the response `info.txt: ASCII text` and when I view the file, its showing `61 4 2 242` and `392 4` as separate lines. So looks like \n is only line-separator but its not aligned to the data. – Himanshu Singh Jun 04 '21 at 14:23
  • Are you using macos? It's sounds impossible to me to have a single line `1 2 3 4\n 5 6\n` and your editor display it as a separate lines and open() show it as a separate lines. You can do the following: inspect the newlines in a binary viewer, for example `cat info.txt | od -c | less`. I believe that you have a macos line-endings (\r) and you can try `pd.read_csv(..., lineterminator='\r')` – Alexander Volkovsky Jun 04 '21 at 16:01
  • Yes, I am using Macos and after running `cat info.txt | od -c | less` I am still not seeing any other line delimiters except \n. Thanks @AlexanderVolkovsky for the tips on checking for line terminators! – Himanshu Singh Jun 04 '21 at 16:44
  • So you need to read two lines as a single line. See my answer below. – Alexander Volkovsky Jun 04 '21 at 17:07

2 Answers2

0

Taking inspiration from the answer by @gold_cy, was able to solve the problem by extending the last element of the list for each alternate row instead of appending a new row to the list:

def strip_newlines(fp):
    file_data_without_line_breaks = []
    i=-1
    with open(fp, "r") as fin:
        for val, line in enumerate(fin.readlines()):
            stripped_line = line.rstrip()
            if(val%2 == 1):
              file_data_without_line_breaks[i].extend(stripped_line.split())
            else:
              i=i+1
              file_data_without_line_breaks.append(stripped_line.split())
    return file_data_without_line_breaks

But this may not be appropriate for large data as the list object is being created in-memory.

0

You can create a file-like object with custom reading logic. file-like object must contain an __iter__ and read methods.

Test data: echo -en '1 2 3 4\n 5 6\n 7 8 9 10\n 11 12\n' > info.txt

class MultiLineReader:
    def __init__(self, filename):
        self.filename = filename
        self.fd = None

    # use as context manager in order to open and close file correctly
    def __enter__(self):
        self.fd = open(self.filename, 'r')
        return self

    def __exit__(self, type, value, traceback):
        self.fd.close()

    # file-like object must have this method
    def __iter__(self):
        while True:
            line = self.readline()
            if not line:
                break
            yield line

    # file-like object must have this method
    # just read a line 
    def read(self, size=-1):
        return self.readline()

    # read two lines at a time
    def readline(self):
        return self.fd.readline().strip() + self.fd.readline()

# example usage
with MultiLineReader("info.txt") as f:
    pd.read_csv(f, sep=r'\s+', header=None)

Alexander Volkovsky
  • 2,588
  • 7
  • 13