0

I've read this question Load data from txt with pandas. However, my data format is a little bit different. Here is the example of the data:

product/productId: B003AI2VGA
review/userId: A141HP4LYPWMSR
review/profileName: Brian E. Erland "Rainbow Sphinx"
review/helpfulness: 7/7
review/score: 3.0
review/time: 1182729600
review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"
review/text: Synopsis: On the daily trek from Juarez, Mexico to ... 

product/productId: B003AI2VGA
review/userId: A328S9RN3U5M68
review/profileName: Grady Harp
review/helpfulness: 4/4
review/score: 3.0
review/time: 1181952000
review/summary: Worthwhile and Important Story Hampered by Poor Script and Production
review/text: THE VIRGIN OF JUAREZ is based on true events...

.
.

I intend to do a sentiment analysis so I want to get only the text and score row in each section. Does anybody how to do this using pandas? Or I need to read the file and analysis each line to extract the review and rating?

jpp
  • 159,742
  • 34
  • 281
  • 339
Coding_Rabbit
  • 1,287
  • 3
  • 22
  • 44

3 Answers3

0

This is one way:

import pandas as pd
from io import StringIO

mystr = StringIO("""product/productId: B003AI2VGA
review/userId: A141HP4LYPWMSR
review/profileName: Brian E. Erland "Rainbow Sphinx"
review/helpfulness: 7/7
review/score: 3.0
review/time: 1182729600
review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"
review/text: Synopsis: On the daily trek from Juarez, Mexico to ... 

product/productId: B003AI2VGA
review/userId: A328S9RN3U5M68
review/profileName: Grady Harp
review/helpfulness: 4/4
review/score: 3.0
review/time: 1181952000
review/summary: Worthwhile and Important Story Hampered by Poor Script and Production
review/text: THE VIRGIN OF JUAREZ is based on true events...""")

# replace mystr with 'file.txt'
df = pd.read_csv(mystr, header=None, sep='|', error_bad_lines=False)

df = pd.DataFrame(df[0].str.split(':', n=1).values.tolist())
df = df.loc[df[0].isin({'review/text', 'review/score'})]

Result:

               0                                                  1
4   review/score                                                3.0
7    review/text   Synopsis: On the daily trek from Juarez, Mexi...
12  review/score                                                3.0
15   review/text    THE VIRGIN OF JUAREZ is based on true events...
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks for your answer, I tried it, but it reminds me error like this: File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 59372, saw 14 – Coding_Rabbit Mar 22 '18 at 17:34
  • You can try `error_bad_lines=False` argument, as in updated answer above, but this is at your own risk (bad data will get skipped over). – jpp Mar 22 '18 at 17:47
  • Thanks. One more question, I want to match the review text and the score. But now they are in separate line. Do you have any idea to combine them into one line? Each line represent a (review text, score) key-value pair. – Coding_Rabbit Mar 22 '18 at 19:07
  • @Coding_Rabbit, It's definitely possible - I suggest you first search on SO, or ask as a separate question. – jpp Mar 22 '18 at 21:23
0

As it is, I am not aware that pandas can read the file.

I would suggest writing a python program that would read your file, and output csv file, let us say named sentiment.csv like so:

Product Id,Reviewer ID,Profile Name,Helpfulness,Score,Time,Summary,text B003AI2VGA,A141HP4LYPWMSR,Brian E. Erland "Rainbow Sphinx",7/7,3.0,1182729600,"There Is So Much Darkness Now ~ Come For The Miracle", Synopsis: On the daily trek from Juarez, Mexico to...

B003AI2VGA,A328S9RN3U5M68,Grady Harp,4/4,3.0,1181952000,Worthwhile and Important Story Hampered by Poor Script and Production,THE VIRGIN OF JUAREZ is based on true events...

Then, use simply: df = pd.read_csv('sentiment.csv')

Community
  • 1
  • 1
sanrio
  • 1
  • The thing is the file is quite big, almost 10G. Will that be quite slow to convert it into csv and read it after? – Coding_Rabbit Mar 22 '18 at 17:32
  • The new converted file will be a lot less than 10G. The problem with the original file is that it repeats the metadata for every row. With the new converted file, the meta data is the first line of the file, and the rest of it is data. As for python converting it to csv, if performance becomes an issue, you could subdivide the original file into smaller files, and handling those. In the end, you merge the resultant files. – sanrio Mar 23 '18 at 02:30
0

I think the answer from @sanrio is likely the most straight-forward but here is an option of doing the string manipulation in pandas:

with open('your_text_file.txt') as f:
    text_lines = f.readlines()

# create pandas Series object where each value is a text line from your file
s = pd.Series(text_lines)

# remove the new-lines
s = s.str.strip()

# extract some columns using regex and represent in a dataframe
df = s.str.split('\s?(.*)/([^:]*):(.*)', expand=True)

# remove irrelevant columns
df = df.replace('', np.nan).dropna(how='all', axis=1)

def gb_organize(df_):
    """
    Organize a sub-dataframe from group-by operation.
    """
    df_ = df_.dropna()
    return pd.DataFrame(df_[3].values, index=df_[2].values).T

# pass a Series object to .groupby to iterate over consecutive non-null rows
df_result = df.groupby(df.isna().all(axis=1).cumsum(), group_keys=False).apply(gb_organize)

df_result = df_result.set_index(['productId', 'userId'])

# then you can access the records you want with the following:
df_result[['score', 'text']]

enter image description here

jeschwar
  • 1,286
  • 7
  • 10