0
I have several .dat, containing information about hotel reviews as below
/*
<Author> simmotours
<Content> review......goes here
<Date>Nov 18, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>4`enter code here`
<Value>4
<Rooms>3
<Location>4
<Cleanliness>4
<Check in / front desk>4
<Service>4
<Business service>-1

*/ I want to classify the review into two pos and neg , i.e. have two folder pos and neg containing several files with reviews above 3 classified as positive and below 3 classified as negative.

How can I quickly and efficiently automate this process?
user3801185
  • 35
  • 1
  • 5

2 Answers2

0

You could write up a python script to read the overall score. Do this by looping over the the lines using readline() See here. Find the "Overall" Score using some string parsing. Then move the file into the right directory. All very simple things to do in Python, just break it down into steps and search for answers to those steps.

Community
  • 1
  • 1
blsmit5728
  • 434
  • 3
  • 11
  • I was thinking of converting the above format to XML by adding , etc and then Parse using some XML parser. But I am blocked on how we could append. i.e. search for * and replace it with * <\Author> – user3801185 Jul 04 '14 at 14:04
  • @user3801185 simple search/replace of `^<(\w+)>(.*)$` with `<\1>\2\1>`, assuming the lines are as in the example and have no embedded `<` or `>`. But would need to have previously changed `` and others with non-alphanumeric to valid tags. – AdrianHHH Jul 07 '14 at 15:19
0

Notepad++ can do replacements with regular expressions. And allows the definition of macros. Use them to convert the file to an XML file. Check out the help file.

Then you can read it with any scripting language and do what you want.

Alternatively you could change the file to a form where you can load it into Excel and do the analysis there.

z--
  • 2,186
  • 17
  • 33