0

I have a folder named x_list with subfolders named [y1,y2 ... y10]. In these subfolders are textfiles located. I need to read these textfiles into Python with there corresponding subfolder-name which is coming from x-list.

I have the following code, that is working. The only issue is that the textfiles are losing the punctuation. I believe the error is in the append function.

df = pd.DataFrame()
x_list = os.listdir(x_path) #list with classes
for i in range(0,len(x_list)):
    x_path2 = x_path+"/"+ x_list[i]
    files = os.listdir(x_path2)
    #Read all the documents from the subfolder and fill the dataframe
    for j in range(0,len(files)):
        p = x_path2+"/" + files[j]
        f = open(p,"r")
        df = df.append({'text':f.read(), 'class':x_list[i]}, ignore_index =True)
        f.close()

The text contains dates but in the output the date are presented like 01012017 instead of 01-01-2017. Also dots, comma's and currencies are lost.

How do I solve this issue, so I don't lose the punctuation.

The output should looks like:

text                         class
Welcome blabla 20-09-2017    y1
Goodbye blabla 23-09-2017    y1
lorum es ti date 09-09-2017  y2
Jelmer
  • 351
  • 1
  • 15
  • You might be reading with a different encoding than the one the file was written. On Linux you can check the file encoding with `file -i file_name`, on OSX you can do `file -I file_name`. – Nogoseke Oct 05 '17 at 10:55
  • For all the ugly part of directories, subdirectories and filenames, I'd use os.walk (https://docs.python.org/3.6/library/os.html). But that is not your problem, so I'd remove it from the title, the description and everything else. Your problem can be sumarised using the lines that involve df, the rest is just noise. – Jblasco Oct 05 '17 at 11:01

2 Answers2

0

I have tried your code, it gives this error:

NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\Users\user\Desktop\foo\asd.txt'

Here is my code:

import pandas as pd
import os

df = pd.DataFrame()
x_path = r"C:\Users\user\Desktop\foo" 
files = os.listdir(x_path)
for file in files:
    p = x_path + "/" + file
    with open(p) as f:
        df = df.append({'text':f.read(), 'class':file}, ignore_index =True) 

print(df)

With this, there is no punctuation problem. Output:(I use Windows 10)

     class                         text
0  asd.txt    Welcome blabla 20-09-2017
1  qwe.txt    Goodbye blabla 23-09-2017
2  zxc.txt  lorum es ti date 09-09-2017

If you try this code and still punctuation is gone, let me know.

Also, if you are not familiar with with statement, use this:

...
f = open(p)
df = df.append({'text':f.read(), 'class':file}, ignore_index =True)
f.close()
...
Alperen
  • 3,772
  • 3
  • 27
  • 49
0

Thanks for your input, but I have solved it. The issue is a bit more simple then I thought. I have a method where the punctuation is removed. I made a copy of my df by doing dfraw = df instead of dfraw = df.copy().

why should I make a copy of a data frame in pandas

Jelmer
  • 351
  • 1
  • 15