Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help with regex 1

Status
Not open for further replies.

MrCBofBCinTX

Technical User
Dec 24, 2003
164
US
I am having trouble figuring out the right regex for this.
I have:
Code:
if ($line =~ /s\.prop1="([\w\s,.-\/i"'+\(\)&%#=:]+)";/) {

I am trying to find lines that meet the requirements for a product description. Those lines may hold any alphanumeric and most punctuation characters.

The correct line is s.prop1="Stuff here";

"Stuff here" does often contain " and ;
I am getting an incorrect match if I include both ; and " in the character class, often getting to much stuff after ending ";

I have no control over these web pages, so I just have to work with what the vendor has online.

An example line is as follows:
Code:
s.prop1="Economy Tile Installation Kit; One V-Notch Trowel, One Square Notch Trowel, One Grout Squeegee";
 
Your pattern works if
1) you remove the special meaning of the "-" by back-slashing it or putting it at the end of your class-list.
2) By adding the semicolon to your class-list.
3) Making the ending ";" optional by following it by a "?"

So try this (you can remove the "i" from the class-list):
if ($line =~ /^s\.prop1="[\w\s;,.\-\/i"'+()&%#=:]+";?/)

Nevertheless your pattern leaves me with some questions before I offer an easier solution.

1) Why do you explicitly look form the letter "i"?

2) Does your line always end with a double quote followed by a semicolon or is the semi colon optional? I am going to assume that the quote is always present and the semi colon is optional.

3) Does your pattern always start with "s.prop1=". If so it is always a good idea to start the regexp with the "^" anchor to improve performance. Likewise, if your pattern always ends with the quote/semicolon, use the "$" anchor. And if you are regexp is in a loop, use the "o" modifier so that Perl only compiles your regexp once instead of once per line of your data file. I have included all these in the solution, but you can remove them if you need to.


The solution avoids hard-coding special characters. You can put them back in if you want the regexp to be more restrictive.

So try this. Let me know if you want to tweak the solution with additional rules.

if ($line =~ /^s\.prop1="[\w\W]+";?$/o)

By the way here is the meaning of

\w = [A-Za-z_]
\W =characters other than the one defined as \W
\s = white-space characters such as blanks, tabs, linefeed carriage return and others. I removed this from my solution as "\w\W" is all I needed
 
The i was a special addition thanks to working using putty from a library, since I am denied, temporarily, from using OpenBSD. Too many 15 minute sessions!


Yes I need white space and the line always ends in ";

I will give these changes a roll.
I didn't think of the /o option, good idea.
 
I just realized that the character class "[\w\W]" equates to "." (i.e any character).

So try this instead:

if ($line =~ /^s\.prop1=".+";$/o)

I removed the ending "?". since you said that the string always ends with a quote and a semi-colon

As a general rule you should always use /o unless your pattern is in a variable that changes values between calls. However I don't think there is any benefit if the regexp is in a subroutine that gets called repeatedly.

 
Well, the ^ and $ don't work.
Not getting any lines with either of them, so some of theses lines must lack a newline where I expected one.

/o is good though

Using /s\.prop1="([\w\s,.-\/"'+\(\)&%#=:]+)";/o right now, but it is still not right.
I am going to look at a few more pages, perhaps I can use the next "line" to catch right piece, because ";/o doesn't do the trick when I include " and ;
 
If you are missing a newline character then your string occurs in the middle of some other text. Hence the "^" and "$" won't work.

If you can identify and show me the lines that are giving you problems, I can help you figure out why. Perhaps there are leading or trailing spaces or maybe "prop1" is in upper case.

Just out of curiosity, are you running from Unix or Windows? Likewise what environment is your data coming from?
 
HA, I think I found the problem.
I wget'ed a page and used vim.
It is filled with a mix of \n and \r's (not on same spot, though)

So these lines have a \r screwing things up.
I have been getting rid of these after a match, to avoid database problems, first time I have had a match problem from them, though.
Question: Does perl treat /r and /n the same or should I just add /r to my regex instead of $
 
Oh well, still not hitting it, keep getting stuff after \r
 
\n and \r are two different characters. You can look for them explicitly, but your \s includes both among others (as I stated before).

Keep one thing in mind regarding the dot operator "." and "\n". By default the "." matches any character except "\n". This is for consistency with other Unix commands from the past (sed, grep, etc.). To override this default use the "/s" modifier (i.e. m//s) which allows "." to match "\n" by treating multi-line text as a Single line string.
 
Can you paste a handfull of items from the web page so we have a better idea of how all the lines/items are represented in the text?

For example, does a new item always begin with '<cr>s.prop1' where <cr> could be some combination of linefeeds and carriage returns? What other data is between one description and the next?

For example, if the text is something like:
Code:
s.prop1="Some Item description";
s.prop1="Some other item description";
At this point, I think we're guessing at what else is in your input data.
 
OK, here is a section from one page, with this cipped out of vi

Code:
        <script type="text/javascript"> s_account="homedepot"</script>^M^M<div style="display:none">^M<script type="text/java
script" src="[URL unfurl="true"]http://www.homedepot.com/wcsstore/hdus/scripts/s_code.js"></script>^M<!--[/URL] SiteCatalyst code version: H.5.Copyrig
ht 1997-2006 Omniture, Inc. More info available at [URL unfurl="true"]http://www.omniture.com[/URL] -->^M<script language="JavaScript"><!--^Ms.eVar26=
h_s_eVar26;^Ms.eVar27=h_s_eVar27;^Ms.eVar28=h_s_eVar28;^Ms.eVar29=h_s_eVar29;^Ms.eVar30=h_s_eVar30;^Ms.referrer=h_s_referrer;
^Ms.prop4=decodeURIComponent(getURLParam("searchRedirect"));^Ms.pageName="productDetails";^Ms.prop28="productDetails";^Ms.cha
nnel="FlooringTile & Stone";^Ms.prop1="le Cutter, for Tiles up to 13 In., with two 7_8 In. Titanium-Coated Tungsten Carbide C
utting Wheels";^Ms.prop2="100056508";^Ms.hier1="Flooring>Tile & Stone>Tile Tools & Accessories>Tile Tools>Hand Tools>13 In. M
anual Tile Cutter, for Tiles up to 13 In., with two 7_8 In. Titanium-Coated Tungsten Carbide Cutting Wheels";^Ms.events="prod
View,event16,event10";^Ms.eVar4="";^M
 var locStoreNbr = readCookie("THD_LOCSTORE");
 
I think the safest approach would be to convert all of your \rs to \ns before you try to do any matching or processing, as long as there is no reason why you need to preserve the line endings in their original format.

Annihilannic.
 
It looks like you have some embedded "\r" in your text represented with "^M" by your editor. Keep in mind that if you cut and paste this text, the "^M" will be pasted as the characters "^" and "M". So make sure you are not dealing with those characters in your data.

I assume you only want "prop1" lines not any other type of "prop" lines that you have.

As suggested you may want to cleanup your data by converting all the "\r" to "\n" or "\r\n".


Nevertheless I got this pattern:

if ($line =~ /\s*s\.prop1="[\w\s;,.\-\/"'+()&%#=:]+";?\s*/o)

to match this line:

$line=q/\rs.prop1="Economy Tile Installation Kit; One V-Notch Trowel, One Square Notch Trowel, One Grout Squeegee";\r/;

by using "\s" in the regexp to match the leading and trailing "\r" in the string.
 
A possibly easier to understand regex would be something like this:
Code:
m/s.prop1="(.+?)";/s;
If there is always a newline (^M) after the end of the 's.prop="...";' you could modify it a bit to something similar to:
Code:
m/s.prop1="(.+?)";[\x0a\x0d]+/s;
It is worth noting that using non-greedy matches does give you a bit of a performance hit, but hopefully it won't be too noticeable.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top