I am currently running a script in a linux system. The script reads a csv of around 6000 lines as a dataframe. The job of the script is to turn a dataframe such as:
name children
Bob [Jeremy, Nancy, Laura]
Jennifer [Kevin, Aaron]
to:
name children childName
Bob [Jeremy, Nancy, Laura] Jeremy
Bob [Jeremy, Nancy, Laura] Nancy
Bob [Jeremy, Nancy, Laura] Laura
Jennifer [Kevin, Aaron] Kevin
Jennifer [Kevin, Aaron] Aaron
And write it to ANOTHER FILE (the original csv is to remain the same).
Basically add a new column and make a row for each item in a list. Note that I am dealing with a dataframe with 7 columns, but for demonstration purposes I am using a smaller examples. The columns in my actual csv are all strings except for two that are lists.
This is my code:
import ast
import os
import pandas as pd
cwd = os.path.abspath(__file__+"/..")
data= pd.read_csv(cwd+"/folded_data.csv", sep='\t', encoding="latin1")
output_path = cwd+"/unfolded_data.csv"
out_header = ["name", "children", "childName"]
count = len(data)
for idx, e in data.iterrows():
print("Row ",idx," out of ",count)
entry = e.values.tolist()
c_lst = ast.literal_eval(entry[1])
for c in c_lst :
n_entry = entry + [c]
if os.path.exists(output_path):
output = pd.read_csv(output_path, sep='\t', encoding="latin1")
else:
output = pd.DataFrame(columns=out_header)
output.loc[len(output)] = n_entry
output.to_csv(output_path, sep='\t', index=False)
But I am getting the following error:
Traceback (most recent call last):
File "fileUnfold.py", line 31, in <module>
output.to_csv(output_path, sep='\t', index=False)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3020, in to_csv
formatter.save()
File "/usr/local/lib/python3.5/dist-packages/pandas/io/formats/csvs.py", line 172, in save
self._save()
File "/usr/local/lib/python3.5/dist-packages/pandas/io/formats/csvs.py", line 288, in _save
self._save_chunk(start_i, end_i)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/formats/csvs.py", line 315, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
MemoryError
Is there another way to do what I want to do without getting this error?
EDIT: csv file if you want to have a look https://media.githubusercontent.com/media/lucas0/Annotator/master/annotator/data/folded_snopes.csv
EDIT2: I am currently using
with open(output_path, 'w+') as f:
output.to_csv(f, index=False, header=True, sep='\t')
And around the 98th row the program starts slowing down considreably. I am pretty sure this is because I am reading the file over and over again as it gets larger. How can I just append a row to the file without reading it?
EDIT3: Here is the actual code that I am using to deal with the data linked in my first edit. This might make it easier to answer.
import ast
import os
import pandas as pd
cwd = os.path.abspath(__file__+"/..")
snopes = pd.read_csv(cwd+"/folded_snopes.csv", sep='\t', encoding="latin1")
output_path = cwd+"/samples.csv"
out_header = ["page", "claim", "verdict", "tags", "date", "author","source_list","source_url"]
count = len(snopes)
for idx, e in snopes.iterrows():
print("Row ",idx," out of ",count)
entry = e.values.tolist()
src_lst = ast.literal_eval(entry[6])
for src in src_lst:
n_entry = entry + [src]
if os.path.exists(output_path):
output = pd.read_csv(output_path, sep='\t', encoding="latin1")
else:
output = pd.DataFrame(columns=out_header)
output.loc[len(output)] = n_entry
with open(output_path, 'w+') as f:
output.to_csv(f, index=False, header=True, sep='\t')