Looking at your data, someone has dumped the str
version of a list into a file as-is, using python2.
One thing's for sure - you can't use a CSV reader for this data. You can't even use a JSON parser (which would've been the next best thing).
What you can do, is use ast.literal_eval
. With python2, this works out of the box.
import ast
data = []
with open('file.txt') as f:
for line in f:
try:
data.append(ast.literal_eval(line))
except (SyntaxError, ValueError):
pass
data
should look something like this -
[(22642441022L,
'<a href="http://example.com">Click</a>',
'fox, dog, cat are examples http://example.com'),
(1153634043,
'<a href="http://example.com">Click</a>',
"I learned so much from my mistakes, I think I'm gonna make some more")]
You can then pass data
into a DataFrame
as-is -
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df
A B \
0 22642441022 <a href="http://example.com">Click</a>
1 1153634043 <a href="http://example.com">Click</a>
C
0 fox, dog, cat are examples http://example.com
1 I learned so much from my mistakes, I think I'...
If you want this to work with python3, you'll need to get rid of the long suffix L
, and the unicode prefix u
. You might be able to do this using re.sub
from the re
module.
import re
for line in f:
try:
i = re.sub('(\d+)L', r'\1', line) # remove L suffix
j = re.sub('(?<=,\s)u(?=\')', '', i) # remove u prefix
data.append(ast.literal_eval(j))
except (SyntaxError, ValueError):
pass
Notice the added re.sub('(\d+)L', r'\1', line)
, which removes the L
suffix at the end of a string of digits.